CKE has already run Qwen2 and Qwen3 on ARM Cortex-A72 hardware (TI TDA4VM). This is not a roadmap post. This is a “what we already ship” post. For ARM silicon teams, the important fact is simple: C-Kernel-Engine already contains production NEON kernel paths for quantized inference, and the companion architectural walkthrough is on youtube.com/@antshivrobotics.
The message to ARM is the same message the SIMD deep dive made for x86 vendors: CKE does not speak about ISA support in abstractions. It names the exact intrinsics, the exact dispatch tier, the exact parity constraint, and the exact kernels that are already hot on real hardware. CPU-first inference is the through-line. The scaling thesis is still 0×∞=0: if the model does not fit in accelerator memory, theoretical accelerator throughput collapses into a systems problem. At the Ethernet boundary, CPUs and GPUs hit the same external bottleneck.
What this post covers
The first part establishes the ARM story: real deployment on TI TDA4VM, what NEON is, and how CKE dispatches into ARM code paths today. The kernel walkthrough then covers the three existing NEON quantized kernels and the reduction patterns they use.
The second part widens the frame: what is still scalar on ARM, what SVE2 on Neoverse V3 unlocks, why bandwidth dominates decode, how compile-time dispatch fits silicon evaluation, and why CKE is an open invitation for ARM AGI CPU bring-up.
CKE Already Runs on ARM
The most important sentence in this post is also the least hypothetical one: CKE has already run Qwen2 and Qwen3 on a TI TDA4VM. That SoC is a Jacinto 7 part with dual Cortex-A72 CPU cores. In other words, the ARM story here is not “NEON would be nice someday.” It is “NEON quantized inference already happened on shipping ARMv8 silicon.”
That matters because the series is written for CPU vendors, not app developers. A developer only needs to know that a model launches. A silicon team wants to know whether the hottest loops are explicit, auditable, and ISA-mapped. In CKE, the answer is yes for the quantized GEMV path that dominates batch-1 decode.
The scaling thesis from the CKE scaling page is the frame around all of this. If the model does not fit in GPU memory, the accelerator conversation instantly becomes a multi-GPU rack conversation. That is where 0×∞=0 stops being a slogan and becomes capital expenditure. A large-memory CPU server can stay in one coherent address space, and the Theory of Constraints says the Ethernet boundary equalizes more of the system than accelerator marketing usually admits.
TDA4VM is modest by datacenter standards, which is exactly why it matters. If quantized Qwen inference works on two Cortex-A72 cores with limited memory, then the “CPU-only is impossible” argument is already broken at the small end before Neoverse-class servers even enter the room. 2 A72 cores The proof point is not peak throughput. It is that the code generation, quantization, compile, and inference stack already closes on ARM silicon.

CPU-only inference thesis:
0 × ∞ = 0
If the model does not fit in accelerator memory,
you do not have an accelerator throughput problem.
You have a system-capacity problem.
Typical implication:
- Multi-GPU memory solution: expensive, networked, complex
- Large-memory CPU server: one address space, cheaper RAM scaling
Theory of Constraints:
Once requests cross the Ethernet boundary,
the external bottleneck dominates both CPU and GPU deployments./**
* CK-ENGINE KERNEL RULES:
* =======================
* 1. NO malloc/free - memory via bump allocator, pointers passed in
* 2. NO OpenMP - parallelization at orchestrator/codegen layer
* 3. API must define: inputs, outputs, workspace, and memory layouts
* 4. Pure computation - deterministic, no side effects
*/The CPU-first bet is therefore already visible on ARM. What remains is not to prove viability from scratch. It is to widen ARM kernel coverage beyond the three existing quantized NEON paths so the rest of the decode stack stops falling back to scalar. For ARM reviewers, “already shipped once” is a stronger starting point than any roadmap deck. The hottest quantized inner loop has already touched real Cortex-A72 silicon.
ARM NEON: The ISA Foundation
NEON is ARM64’s universal SIMD baseline: 128-bit vector registers, 32 architectural vector registers v0 through v31, and lane types that cover int8, int16, int32, and fp32. For CKE’s quantized kernels that means the exact operand widths needed for dequant-and-dot work: bytes for packed weights and activations, widened halfwords for intermediate products, then 32-bit accumulators.
The practical beauty of NEON is not that it is exotic. It is the opposite. Every serious ARM64 CPU target already has it, from small Cortex-A cores through Apple Silicon and up to server-class Neoverse. That makes it the natural first ARM tier for CKE, just as SSE2 was the unavoidable x86 baseline in earlier posts.
| ISA | Register width | Typical lanes | Register count | What matters for CKE |
|---|---|---|---|---|
| NEON | 128-bit | 16×int8 / 8×int16 / 4×fp32 | 32 vector registers | Universal ARM64 SIMD baseline for quantized GEMV. |
| SSE2 | 128-bit | 16×int8 / 4×fp32 | 16 XMM registers in x86-64 | Useful width but fewer architectural vector registers. |
| AVX | 256-bit | 32×int8 / 8×fp32 | 16 YMM registers | Wider than NEON, but still manual reduction-heavy. |
| AVX-512 | 512-bit | 64×int8 / 16×fp32 | 32 ZMM registers | Wider vectors plus cleaner reductions and richer integer instructions. |
NEON register file on AArch64:
v0 = [ lane0 lane1 lane2 lane3 ... lane15 ]
v1 = [ lane0 lane1 lane2 lane3 ... lane15 ]
...
v31 = [ lane0 lane1 lane2 lane3 ... lane15 ]
For the Q8_0 kernel in CKE:
int8x16_t = 16 signed bytes at once
int16x8_t = 8 widened halfwords at once
int32x4_t = 4 accumulated partial sums at onceLoad / duplicate:
vld1q_s8(ptr)
vdupq_n_s32(0)
Multiply and widen:
vmull_s8(a, b)
vmovl_s8(vget_low_s8(x))
vmull_s16(a, b)
Accumulate:
vaddq_s32(acc, x)
vmlal_s16(acc, a, b)
vpaddlq_s16(x)
Store / final reduction:
vst1q_s32(lanes, acc)From an ISA reviewer’s perspective, one of NEON’s underrated strengths is register count. ARM64 gives the software 32 vector registers even at 128-bit width, which is a better register story than classic SSE and one reason ARM code can stage unpacked and scaled data cleanly. 32 × 128-bit That is twice the vector register count of x86-64 SSE-era XMM state, even though both tiers are 128-bit wide.

For CKE, that universality matters more than peak width. Once a quantized kernel is written in NEON intrinsics, every AArch64 deployment tier immediately gets a concrete SIMD implementation instead of a scalar apology.
The Dispatch Pattern: How CKE Selects ARM NEON
CKE’s dispatch model is intentionally plain. There is no JIT, no opaque runtime codegen, and no mystery about which tier is active. The production pattern is compile-time #if / #elif selection. On x86 that means AVX-512, then AVX2, then AVX, then SSE. On ARM64 it means the compiler’s __aarch64__ and __ARM_NEON macros expose the NEON tier directly.
The key operational point is that NEON is in the actual dispatch chain today. For Q8_0 × Q8_0 and Q5_0 × Q8_0 it is not a dead stub. It is the active selected path on ARM builds. For Q6_K × Q8_K the NEON kernel is compiled and callable, but the public dispatch remains pinned to the scalar reference for llama.cpp-style reduction-order parity.
void vec_dot_q8_0_q8_0(int n, float *s, const void *vx, const void *vy)
{
const char *ref_env = getenv("CK_DEBUG_Q8_0_Q8_0_REF");
if (ref_env && ref_env[0] && ref_env[0] != '0') {
vec_dot_q8_0_q8_0_ref(n, s, vx, vy);
return;
}
#ifdef __AVX512F__
vec_dot_q8_0_q8_0_avx512(n, s, vx, vy);
#elif defined(__AVX2__)
vec_dot_q8_0_q8_0_avx2(n, s, vx, vy);
#elif defined(__ARM_NEON) || defined(__aarch64__)
vec_dot_q8_0_q8_0_neon(n, s, vx, vy);
#elif defined(__AVX__)
vec_dot_q8_0_q8_0_avx(n, s, vx, vy);
#elif defined(__SSE4_1__)
vec_dot_q8_0_q8_0_sse(n, s, vx, vy);
#else
vec_dot_q8_0_q8_0_ref(n, s, vx, vy);
#endif
}/* ARM */
#if defined(__aarch64__)
#if defined(__ARM_FEATURE_SVE2)
#define CK_HAS_SVE2 1
#endif
#if defined(__ARM_FEATURE_NEON)
#define CK_HAS_NEON 1
#endif
#endif
/* Best available vector width */
#if defined(CK_HAS_AMX)
#define CK_VECTOR_WIDTH 512
#define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX512)
#define CK_VECTOR_WIDTH 512
#define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX2_FMA)
#define CK_VECTOR_WIDTH 256
#define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX)
#define CK_VECTOR_WIDTH 256
#define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_NEON)
#define CK_VECTOR_WIDTH 128
#define CK_HAS_BEST_VECTOR 1
#else
#define CK_VECTOR_WIDTH 32 /* Scalar fallback */
#define CK_HAS_BEST_VECTOR 0
#endifAVX-512
↓
AVX2
↓
ARM NEON
↓
AVX
↓
SSE4.1 / SSSE3
↓
Scalar reference
On AArch64:
compiler defines __aarch64__
include <arm_neon.h>
select *_neon() implementation where presentThis is exactly the kind of dispatch code a silicon team wants to inspect: short, deterministic, and easy to force back to reference. There is nowhere for implementation truth to hide. Cross-compile for AArch64 and the quantized GEMV inner loop immediately stops being scalar wherever a NEON branch exists. That is the entire practical meaning of “ARM support” at kernel level.

Just as in the SIMD deep dive, the kernel rules matter here too: no malloc, no OpenMP inside the kernel, pure computation only. That constraint is what keeps ISA dispatch reviewable instead of burying hardware decisions inside runtime orchestration noise.
Q8_0 × Q8_0: The Cleanest NEON Kernel
Of the three ARM paths, vec_dot_q8_0_q8_0_neon() is the cleanest expression of the idea. There is no nibble unpacking, no high-bit merge, and no extra scale remapping. It is just a blockwise int8 dot product: load bytes, multiply pairs into int16, pairwise widen-accumulate into int32, then collapse the four accumulator lanes into one scalar.
This is the ARM version of the same systems message the SIMD deep dive made about x86: the algorithm is not changing. The ISA surface is changing. NEON and AVX2 are two different ways of spelling the same math.
#if defined(__ARM_NEON) || defined(__aarch64__)
void vec_dot_q8_0_q8_0_neon(int n, float *s, const void *vx, const void *vy)
{
const int qk = QK8_0;
const int nb = n / qk;
const block_q8_0 *x = (const block_q8_0 *)vx;
const block_q8_0 *y = (const block_q8_0 *)vy;
float sumf = 0.0f;
for (int ib = 0; ib < nb; ib++) {
const int8x16_t x0 = vld1q_s8(&x[ib].qs[0]);
const int8x16_t x1 = vld1q_s8(&x[ib].qs[16]);
const int8x16_t y0 = vld1q_s8(&y[ib].qs[0]);
const int8x16_t y1 = vld1q_s8(&y[ib].qs[16]);
int32x4_t acc = vdupq_n_s32(0);
const int16x8_t p0 = vmull_s8(vget_low_s8(x0), vget_low_s8(y0));
const int16x8_t p1 = vmull_s8(vget_high_s8(x0), vget_high_s8(y0));
const int16x8_t p2 = vmull_s8(vget_low_s8(x1), vget_low_s8(y1));
const int16x8_t p3 = vmull_s8(vget_high_s8(x1), vget_high_s8(y1));
acc = vaddq_s32(acc, vpaddlq_s16(p0));
acc = vaddq_s32(acc, vpaddlq_s16(p1));
acc = vaddq_s32(acc, vpaddlq_s16(p2));
acc = vaddq_s32(acc, vpaddlq_s16(p3));
int32_t lanes[4];
vst1q_s32(lanes, acc);
const int sumi = lanes[0] + lanes[1] + lanes[2] + lanes[3];
sumf += (float)sumi * (CK_FP16_TO_FP32(x[ib].d) * CK_FP16_TO_FP32(y[ib].d));
}
*s = sumf;
}
#endifLoad phase:
vld1q_s8(&x[ib].qs[0]) -> 16 activation bytes
vld1q_s8(&y[ib].qs[0]) -> 16 weight bytes
Multiply phase:
vmull_s8(low8, low8) -> 8 int16 products
vmull_s8(high8, high8) -> 8 int16 products
Pairwise accumulate:
vpaddlq_s16(p0) -> 4 int32 partial sums
vpaddlq_s16(p1) -> 4 int32 partial sums
Block reduction:
vaddq_s32(acc, partial)
vst1q_s32(lanes, acc)
lanes[0] + lanes[1] + lanes[2] + lanes[3]for (; ib < nb; ++ib) {
const __m256 d = _mm256_set1_ps(CK_FP16_TO_FP32(x[ib].d) * CK_FP16_TO_FP32(y[ib].d));
const __m256i qx = _mm256_loadu_si256((const __m256i *)x[ib].qs);
const __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
const __m256 q = mul_sum_i8_pairs_float_q8_0_avx2(qx, qy);
#if defined(__FMA__)
acc = _mm256_fmadd_ps(d, q, acc);
#else
acc = _mm256_add_ps(_mm256_mul_ps(d, q), acc);
#endif
}
sumf = hsum_float_8_q8_0(acc);| Intrinsic | What it does in the NEON path | Why it matters |
|---|---|---|
vld1q_s8 | Loads 16 signed int8 lanes. | Moves packed quantized values directly into vector state. |
vmull_s8 | Multiplies 8 int8 lanes and widens to int16. | Avoids scalar unpack loops. |
vpaddlq_s16 | Pairwise add and widen int16 to int32. | Collapses 8 products into 4 safer accumulator lanes. |
vaddq_s32 | Adds int32 vectors. | Maintains block accumulation in-register. |
vst1q_s32 | Stores the final four accumulator lanes. | Enables the manual horizontal sum CKE uses today. |
There is no hand-waving here. The kernel is an explicit int8×int8 dot product expressed in NEON intrinsics. If an ARM reviewer wants to understand CKE’s current ARM competence in one screenful, this is the screenful to start with. The x86 SIMD deep dive showed the same idea on Intel and AMD instruction sets. The ARM takeaway is not that NEON equals AVX2 in width. It is that CKE knows how to map the same quantized contract into ARM’s own widening-and-reduction idioms.

Q5_0 × Q8_0: Bit-Unpacking Plus NEON
Q5_0 × Q8_0 adds the complication that makes quantized CPU kernels feel like real systems work rather than textbook linear algebra: the weights are not stored as straightforward signed bytes. They are stored as 5-bit values split between low nibbles and a separate high-bit field. The NEON compute core is still elegant, but there is scalar unpack work before the vector multiply starts.
This is exactly why ISA reviewers should care about file-level kernel competence. Real quantized inference is not a single pretty dot instruction. It is layout decode plus arithmetic plus reduction plus parity policy.
#if defined(__ARM_NEON) || defined(__aarch64__)
static inline int32_t ck_hsum_s32x4(int32x4_t v)
{
int32_t lanes[4];
vst1q_s32(lanes, v);
return lanes[0] + lanes[1] + lanes[2] + lanes[3];
}
#endifuint32_t qh;
memcpy(&qh, x[ib].qh, sizeof(qh));
int8_t wvals[QK5_0];
for (int j = 0; j < qk / 2; j++) {
const uint8_t packed = x[ib].qs[j];
const uint8_t xh_0 = ((qh >> (j + 0)) & 1u) << 4;
const uint8_t xh_1 = ((qh >> (j + 16)) & 1u) << 4;
wvals[j] = (int8_t)(((packed & 0x0F) | xh_0) - 16);
wvals[j + qk / 2] = (int8_t)(((packed >> 4) | xh_1) - 16);
}void vec_dot_q5_0_q8_0_neon(int n, float *s, const void *vx, const void *vy)
{
const int qk = QK5_0;
const int nb = n / qk;
const block_q5_0 *x = (const block_q5_0 *)vx;
const block_q8_0 *y = (const block_q8_0 *)vy;
float sumf = 0.0f;
for (int ib = 0; ib < nb; ib++) {
uint32_t qh;
memcpy(&qh, x[ib].qh, sizeof(qh));
int8_t wvals[QK5_0];
for (int j = 0; j < qk / 2; j++) {
const uint8_t packed = x[ib].qs[j];
const uint8_t xh_0 = ((qh >> (j + 0)) & 1u) << 4;
const uint8_t xh_1 = ((qh >> (j + 16)) & 1u) << 4;
wvals[j] = (int8_t)(((packed & 0x0F) | xh_0) - 16);
wvals[j + qk / 2] = (int8_t)(((packed >> 4) | xh_1) - 16);
}
const int8x16_t w0 = vld1q_s8(&wvals[0]);
const int8x16_t w1 = vld1q_s8(&wvals[16]);
const int8x16_t x0 = vld1q_s8(&y[ib].qs[0]);
const int8x16_t x1 = vld1q_s8(&y[ib].qs[16]);
int32x4_t acc = vdupq_n_s32(0);
int16x8_t p0 = vmull_s8(vget_low_s8(w0), vget_low_s8(x0));
int16x8_t p1 = vmull_s8(vget_high_s8(w0), vget_high_s8(x0));
int16x8_t p2 = vmull_s8(vget_low_s8(w1), vget_low_s8(x1));
int16x8_t p3 = vmull_s8(vget_high_s8(w1), vget_high_s8(x1));
acc = vaddq_s32(acc, vpaddlq_s16(p0));
acc = vaddq_s32(acc, vpaddlq_s16(p1));
acc = vaddq_s32(acc, vpaddlq_s16(p2));
acc = vaddq_s32(acc, vpaddlq_s16(p3));
const float d = CK_FP16_TO_FP32(x[ib].d) * CK_FP16_TO_FP32(y[ib].d);
sumf += d * (float)ck_hsum_s32x4(acc);
}
*s = sumf;
}void vec_dot_q5_0_q8_0(int n, float *s, const void *vx, const void *vy)
{
#if defined(__AVX512F__)
vec_dot_q5_0_q8_0_avx512(n, s, vx, vy);
#elif defined(__AVX2__)
vec_dot_q5_0_q8_0_avx2(n, s, vx, vy);
#elif defined(__ARM_NEON) || defined(__aarch64__)
vec_dot_q5_0_q8_0_neon(n, s, vx, vy);
#elif defined(__AVX__)
vec_dot_q5_0_q8_0_avx(n, s, vx, vy);
#elif defined(__SSSE3__)
vec_dot_q5_0_q8_0_sse(n, s, vx, vy);
#else
vec_dot_q5_0_q8_0_ref(n, s, vx, vy);
#endif
}The key difference from Q8_0 is representational overhead. Q5_0 requires unpack work because 5-bit weights do not line up cleanly on byte boundaries, so the kernel first reconstructs signed int8 values and then hands the multiply to NEON. 5-bit weights This is why quantized kernel engineering is half data-layout archaeology and half SIMD arithmetic.

The compute sequence after unpack is familiar: vld1q_s8 → vmull_s8 → vpaddlq_s16 → vaddq_s32. That repetition is valuable. It means once the format-specific decode is done, the arithmetic surface returns to a stable ARM pattern.
Q6_K × Q8_K: The Most Complex NEON Kernel
Q6_K × Q8_K is the most important ARM kernel in this post because it is closer to large-model decode reality. The Q6_K format packs 256 weights per block, splits each 6-bit value across low and high fields, and applies per-sub-block scales. In other words, it looks much more like the actual quantized inference work a CPU stack must do for modern LLMs.
The NEON path here is more elaborate than the clean Q8_0 kernel because it performs a three-stage pipeline: reconstruct the 6-bit weights, align sub-block scales, and then do the widened multiply-accumulate sequence against the Q8 activations.
/**
* Q6_K Format (256 weights per block):
* - d: FP16 super-block scale
* - ql: 128 bytes (low 4 bits of each weight)
* - qh: 64 bytes (high 2 bits of each weight)
* - scales: 16 int8 sub-block scales
*
* Q8_K Format (256 weights per block):
* - d: FP32 scale
* - qs: 256 int8 values
* - bsums: 16 int16 block sums
*/#if defined(__ARM_NEON) || defined(__aarch64__)
static float dot_q6_k_q8_k_neon(const block_q6_K *w,
const block_q8_K *x,
int K)
{
const int nb = K / QK_K;
float sumf = 0.0f;
for (int i = 0; i < nb; ++i) {
const float d = GGML_FP16_TO_FP32(w[i].d) * x[i].d;
const uint8_t *ql = w[i].ql;
const uint8_t *qh = w[i].qh;
const int8_t *sc = w[i].scales;
const int8_t *q8 = x[i].qs;
int8_t wvals[QK_K];
int8_t svals[QK_K];
for (int n = 0; n < QK_K; n += 128) {
for (int l = 0; l < 32; ++l) {
const int is = l / 16;
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
const int base = n;
wvals[base + l + 0] = q1;
wvals[base + l + 32] = q2;
wvals[base + l + 64] = q3;
wvals[base + l + 96] = q4;
svals[base + l + 0] = sc[is + 0];
svals[base + l + 32] = sc[is + 2];
svals[base + l + 64] = sc[is + 4];
svals[base + l + 96] = sc[is + 6];
}
ql += 64;
qh += 32;
sc += 8;
}
int32x4_t acc = vdupq_n_s32(0);
for (int j = 0; j < QK_K; j += 16) {
const int8x16_t wv = vld1q_s8(&wvals[j]);
const int8x16_t sv = vld1q_s8(&svals[j]);
const int8x16_t xv = vld1q_s8(&q8[j]);
const int16x8_t ws0 = vmull_s8(vget_low_s8(wv), vget_low_s8(sv));
const int16x8_t ws1 = vmull_s8(vget_high_s8(wv), vget_high_s8(sv));
const int16x8_t x0 = vmovl_s8(vget_low_s8(xv));
const int16x8_t x1 = vmovl_s8(vget_high_s8(xv));
int32x4_t p0 = vmull_s16(vget_low_s16(ws0), vget_low_s16(x0));
p0 = vmlal_s16(p0, vget_high_s16(ws0), vget_high_s16(x0));
int32x4_t p1 = vmull_s16(vget_low_s16(ws1), vget_low_s16(x1));
p1 = vmlal_s16(p1, vget_high_s16(ws1), vget_high_s16(x1));
acc = vaddq_s32(acc, p0);
acc = vaddq_s32(acc, p1);
}
int32_t lanes[4];
vst1q_s32(lanes, acc);
sumf += d * (float)(lanes[0] + lanes[1] + lanes[2] + lanes[3]);
}
return sumf;
}
#endifvoid vec_dot_q6_k_q8_k(int n, float *s, const void *vx, const void *vy)
{
if (!s || !vx || !vy || n <= 0) {
return;
}
const block_q6_K *x = (const block_q6_K *)vx;
const block_q8_K *y = (const block_q8_K *)vy;
/* Keep the public dispatch on the llama-compatible reduction order.
* The SIMD variants are still exported for direct tests/benchmarks, but
* their horizontal reduction order can move borderline logits in Qwen3.5
* long-decode parity. */
*s = dot_q6_k_q8_k_ref(x, y, n);
}Q8_0 path:
int8 load
vmull_s8
vpaddlq_s16
vaddq_s32
store lanes + scalar sum
Q6_K path:
unpack ql + qh into 6-bit signed bytes
map sub-block scales into svals[]
vmull_s8(weight, scale)
vmovl_s8(activation)
vmull_s16 / vmlal_s16
vaddq_s32
store lanes + scalar sumThis is the kernel that makes the TDA4VM story matter. It is not just a toy byte-dot loop. It is a production-style packed-format decode kernel with explicit NEON arithmetic in the hottest part of inference. The public entry point stays on the scalar reduction order for parity, but that does not erase the engineering signal. The NEON kernel exists, compiles, and is available for direct tests and benchmarks.

The important design choice is that the algorithm stays visible. Nothing about this kernel is opaque to hardware review. Every unpack, widen, multiply, and reduction stage is spelled in ordinary C plus NEON intrinsics.
The Horizontal Reduction Problem
All three ARM kernels in this post eventually hit the same awkward point: SIMD vectors are great at lane-local work, but the kernel finally needs one scalar answer. Today CKE handles that with the most portable approach possible on NEON: store the accumulator lanes to memory and add them in scalar code.
This is not a design flaw. It is a revealing ISA detail. In the x86 SIMD discussion, AVX-512 made reductions much cleaner than AVX1. ARM has a similar story: baseline NEON often uses manual lane collapse, while later ARMv8.1+ cores expose helpers like vaddvq_s32 that make the final reduction tighter.
int32_t lanes[4];
vst1q_s32(lanes, acc);
const int sumi = lanes[0] + lanes[1] + lanes[2] + lanes[3];#if defined(__AVX__) && !defined(__AVX512F__)
static inline float hsum256_ps_rmsnorm(__m256 v) {
__m128 hi = _mm256_extractf128_ps(v, 1);
__m128 lo = _mm256_castps256_ps128(v);
__m128 sum128 = _mm_add_ps(lo, hi);
sum128 = _mm_hadd_ps(sum128, sum128);
sum128 = _mm_hadd_ps(sum128, sum128);
return _mm_cvtss_f32(sum128);
}
#endif__m512 sum_sq_vec = _mm512_setzero_ps();
for (; d + 16 <= D; d += 16) {
__m512 xv = _mm512_loadu_ps(&x[d]);
sum_sq_vec = _mm512_fmadd_ps(xv, xv, sum_sq_vec);
}
float sum_sq = _mm512_reduce_add_ps(sum_sq_vec);#if defined(__ARM_FEATURE_QRDMX) || defined(__aarch64__)
static inline int32_t ck_hsum_s32x4_fast(int32x4_t v)
{
return vaddvq_s32(v);
}
#endif| Tier | Reduction method | Consequence |
|---|---|---|
| Current CKE NEON | vst1q_s32 then scalar lane add | Portable and correct, but a little clunky. |
| AVX1 / AVX2 | Custom helper with extracts and horizontal adds | Similar manual choreography. |
| AVX-512 | _mm512_reduce_add_ps | Reduction is explicit in the ISA surface. |
| ARMv8.1+ option | vaddvq_s32 | Cleaner NEON reduction without changing kernel math. |
Reduction is one of the most telling places to compare ISAs because it exposes whether the hardware sees ML-style accumulations as first-class patterns or as something software still has to stitch together by hand. 1 instruction That is the practical appeal of AVX-512 reduce-add and modern ARM horizontal-reduction helpers: less choreography in the hottest collapse step.
What Is Not NEON-Optimized Yet
The honest ARM story is not “CKE is fully optimized on ARM.” The honest story is much more interesting: ARM already has three real quantized kernels, but roughly eighty other kernel files still fall back to scalar on ARM even when x86 has SIMD specializations. That is exactly the kind of partially-complete but clearly-real state that tells a silicon team where joint work could matter.
The first order bit is coverage. The second is priority. Not every kernel deserves equal attention. Decode-critical normalization, attention, activation, and RoPE kernels should be ported before long-tail utilities.
| Inventory slice | Count | Meaning |
|---|---|---|
Total kernel .c files | 83 | CKE is already a broad kernel repository, not a single-demo codebase. |
| Files with x86 SIMD intrinsics | 49 | x86 path is already materially ISA-specialized. |
| Files with ARM NEON intrinsics | 3 | ARM coverage exists today, but only for quantized GEMV/dot. |
| Files still scalar on ARM | ≈80 | Main opportunity for NEON and SVE2 expansion. |
| Priority | Kernel family | Why ARM should care |
|---|---|---|
| 1 | rmsnorm_kernels.c | Every layer touches it; current ARM fallback is scalar. |
| 2 | attention_kernels.c | 258 x86 intrinsics already exist; decode latency depends on it. |
| 3 | swiglu_kernels.c / gelu_kernels.c | MLP activation bandwidth and FMA density matter per token. |
| 4 | rope_kernels.c | Position embedding math is regular and mechanically portable. |
| 5 | softmax_kernels.c | Reduction-heavy and numerically sensitive, but still formulaically straightforward. |
void rmsnorm_forward_neon(const float *x,
const float *gamma,
float *y,
int D,
float eps)
{
float32x4_t sum_sq_v = vdupq_n_f32(0.0f);
int d = 0;
for (; d + 4 <= D; d += 4) {
float32x4_t xv = vld1q_f32(&x[d]);
sum_sq_v = vfmaq_f32(sum_sq_v, xv, xv);
}
float sum_sq = vaddvq_f32(sum_sq_v);
for (; d < D; ++d) {
sum_sq += x[d] * x[d];
}
const float rstd = 1.0f / sqrtf(sum_sq / (float)D + eps);
float32x4_t rstd_v = vdupq_n_f32(rstd);
d = 0;
for (; d + 4 <= D; d += 4) {
float32x4_t xv = vld1q_f32(&x[d]);
float32x4_t gv = vld1q_f32(&gamma[d]);
float32x4_t yv = vmulq_f32(vmulq_f32(xv, rstd_v), gv);
vst1q_f32(&y[d], yv);
}
for (; d < D; ++d) {
y[d] = x[d] * rstd * gamma[d];
}
}for (int h = 0; h < num_heads; ++h) {
float32x4_t acc0 = vdupq_n_f32(0.0f);
for (int d = 0; d + 4 <= head_dim; d += 4) {
float32x4_t qv = vld1q_f32(&q[h * head_dim + d]);
float32x4_t kv = vld1q_f32(&k_cache[token * head_dim + d]);
acc0 = vfmaq_f32(acc0, qv, kv);
}
scores[h] = vaddvq_f32(acc0) * inv_sqrt_d;
}// RoPE sketch
float32x4_t xv = vld1q_f32(x + d);
float32x4_t cv = vld1q_f32(cos_t + d);
float32x4_t sv = vld1q_f32(sin_t + d);
float32x4_t y0 = vfmsq_f32(vmulq_f32(xv, cv), rot, sv);
// SwiGLU sketch
float32x4_t a = vld1q_f32(gate + d);
float32x4_t b = vld1q_f32(up + d);
float32x4_t sig = sigmoid4_neon(a);
float32x4_t out = vmulq_f32(vmulq_f32(a, sig), b);The ARM opportunity is therefore not “invent new math.” It is port existing correct math into ARM intrinsics, starting with the kernels that sit in every decode step. That is why the remaining work is best described as mechanical rather than algorithmic. Scalar correctness already exists. The task is to change the execution surface.
SVE2: What ARM AGI CPU Unlocks
If NEON is the first ARM proof point, SVE2 is the architecture-scale destination. The jump is not just wider vectors. It is scalable vector length, predication, and native support for patterns that look much more like modern ML loops. That matters because the target hardware under discussion is not a phone core. It is ARM’s AGI CPU class, built around Neoverse V3: 136 cores, SVE2, native bfloat16, 12 DDR5 channels at 8800 MT/s, and CXL 3.0 for memory expansion.
For CKE, SVE2 means one source-level vector algorithm can become vector-length agnostic. The same binary can run on 128-bit SVE implementations and on much wider ones without rewriting the cleanup loop for each width. That is a big deal for normalization, activation, and decode kernels whose bodies are structurally regular.
| ARM AGI CPU trait | Why it matters to CKE |
|---|---|
| 136 Neoverse V3 cores | Token decode is embarrassingly parallel across sessions even when a single stream is bandwidth-bound. |
| SVE2 | Predicated vector loops remove a lot of manual tail handling. |
| Native bfloat16 | Opens a serious path for bf16 kernels beyond fp32 scalar fallback. |
| 12× DDR5 8800 MT/s | Exactly the resource memory-bound GEMV wants. |
| CXL 3.0 | Multi-terabyte memory scaling aligns with CKE’s CPU-first thesis. |
#if defined(__aarch64__)
#if defined(__ARM_FEATURE_SVE2)
#define CK_HAS_SVE2 1
#endif
#if defined(__ARM_FEATURE_NEON)
#define CK_HAS_NEON 1
#endif
#endif#include <arm_sve.h>
void rmsnorm_forward_sve2(const float *x,
const float *gamma,
float *y,
int D,
float eps)
{
svfloat32_t sum_sq = svdup_f32(0.0f);
int d = 0;
while (d < D) {
svbool_t pg = svwhilelt_b32(d, D);
svfloat32_t xv = svld1_f32(pg, &x[d]);
sum_sq = svfmla_f32_z(pg, sum_sq, xv, xv);
d += svcntw();
}
float total = svaddv_f32(svptrue_b32(), sum_sq);
const float rstd = 1.0f / sqrtf(total / (float)D + eps);
svfloat32_t rstd_v = svdup_f32(rstd);
d = 0;
while (d < D) {
svbool_t pg = svwhilelt_b32(d, D);
svfloat32_t xv = svld1_f32(pg, &x[d]);
svfloat32_t gv = svld1_f32(pg, &gamma[d]);
svfloat32_t yv = svmul_f32_z(pg, svmul_f32_z(pg, xv, rstd_v), gv);
svst1_f32(pg, &y[d], yv);
d += svcntw();
}
}Predication primitives:
svptrue_b32() -> all lanes active
svwhilelt_b32(i, D) -> tail predicate, no scalar cleanup loop
Predicated memory:
svld1_f32(pg, ptr)
svst1_f32(pg, ptr)
Predicated arithmetic:
svfmla_f32_z(pg, acc, a, b)
svmul_f32_z(pg, a, b)
Bfloat16 opportunity:
svbfmmla(...) -> bf16 matrix multiply-accumulate on supporting coressvbool_t pg = svptrue_b16();
svbfloat16_t av = svld1_bf16(pg, a_ptr);
svbfloat16_t bv = svld1_bf16(pg, b_ptr);
svfloat32_t acc = svdup_f32(0.0f);
acc = svbfmmla_f32(acc, av, bv);The ARM AGI CPU target is compelling precisely because it aligns ISA evolution with memory-system scale. SVE2 gives cleaner vector semantics; Neoverse V3 gives the core count and DDR bandwidth that decode workloads actually consume. 136 cores in the target Neoverse V3 class. That is enough parallel host capacity to turn CKE from “it runs” into “it scales.”

In other words, NEON proves CKE already speaks ARM. SVE2 is what would let it speak ARM at flagship server scale.
The TDA4VM Story: Where This Already Works
The TI TDA4VM matters because it strips the conversation of excuses. This is not a giant 500-watt server CPU. It is a dual-core Cortex-A72 SoC from a prior generation, with NEON and a constrained memory budget. Yet Qwen2 and Qwen3 were run on it through CKE’s quantized path.
That means the whole pipeline closed: model path selection, quantized weights, generated C kernels, ARM compile, and actual inference. Most importantly, it means the quantized GEMV inner loop was not a scalar-only placeholder. The NEON kernels discussed above were in the relevant thermal and latency path.
| Platform | CPU profile | What it proves |
|---|---|---|
| TI TDA4VM | 2× Cortex-A72, ARMv8.0, NEON, limited RAM | CKE already runs real quantized Qwen inference on ARM. |
| ARM AGI CPU / Neoverse V3 | 136 cores, SVE2, bf16, 12-channel DDR5, CXL | Natural next target for scaled CPU-only deployment. |
model definition
-> template / IR selection
-> kernel code generation
-> C source with quantized kernels
-> AArch64 compile
-> deploy on TI TDA4VM
-> run Qwen2 / Qwen3 inference# Current proof tier
clang -O3 -march=armv8-a+simd -ffast-math -c gemm_kernels_q8_0.c
# Neoverse-class NEON tier
clang -O3 -mcpu=neoverse-v2 -c gemm_kernels_q5_0.c
# Future SVE2 tier
clang -O3 -march=armv9-a+sve2+bf16 -c rmsnorm_kernels.cA dual-core 2017-era ARM SoC already ran real Qwen models. That is the fact ARM should keep in mind before anyone says CPU inference on ARM is speculative. If the proof point already exists on Cortex-A72, the real question is what 136 Neoverse V3 cores plus SVE2 plus multi-terabyte DDR5 unlock.
Compile-Time vs Runtime Dispatch
CKE today is mostly a compile-time dispatch codebase. That means separate binaries can be produced for different ISA tiers without paying runtime branch cost inside the hot kernel entry. For silicon evaluation this is often ideal: build one binary for NEON, another for SVE2, and compare them cleanly.
There is, of course, a runtime-dispatch alternative. ARM Linux exposes hardware capability bits through getauxval(AT_HWCAP) and AT_HWCAP2. A single binary can inspect those flags and choose between scalar, NEON, SVE, and SVE2 paths. That is more flexible for broad distribution, but it complicates benchmarking and leaves an extra branch surface in the dispatch layer.
#if defined(__ARM_FEATURE_SVE2)
attention_decode_sve2(...);
#elif defined(__ARM_NEON) || defined(__aarch64__)
attention_decode_neon(...);
#else
attention_decode_ref(...);
#endif#include <sys/auxv.h>
#include <asm/hwcap.h>
void attention_decode_auto(...)
{
unsigned long hwcap = getauxval(AT_HWCAP);
unsigned long hwcap2 = getauxval(AT_HWCAP2);
if (hwcap2 & HWCAP2_SVE2) {
attention_decode_sve2(...);
} else if (hwcap & HWCAP_ASIMD) {
attention_decode_neon(...);
} else {
attention_decode_ref(...);
}
}Target A: armv8-a + NEON
- deterministic codegen
- zero runtime ISA branch cost
- easy perf counter attribution
Target B: armv9-a + SVE2 + bf16
- deterministic codegen
- direct apples-to-apples ISA comparison
- simpler regression tracking across compiler versions| Approach | Strength | Weakness |
|---|---|---|
| Compile-time dispatch | Deterministic and zero-overhead in the hot path. | Requires multiple binaries for multiple ISA tiers. |
| Runtime dispatch | One binary can span more machines. | Adds branch surface and benchmarking complexity. |
For a silicon team, compile-time dispatch is usually a feature, not a limitation. It makes the mapping between compiler flags, intrinsics, and perf counters explicit. That is especially true when evaluating a new ISA tier like SVE2: one optimized binary per tier is exactly what the bring-up workflow wants.
Memory Bandwidth: Why ARM Server CPUs Matter for GEMV
Once the math is vectorized, decode becomes a bandwidth story. Batch-1 LLM inference is dominated by repeatedly streaming weights through GEMV-like kernels. That puts the workload on the flat, memory-bound side of the roofline model. Wider SIMD helps, but only until memory traffic becomes the real limiter.
This is why the AGI CPU specification matters so much. Twelve DDR5 channels at 8800 MT/s and CXL memory expansion are not generic datacenter bragging points. They are exactly the system resources a CPU-first quantized inference engine can actually monetize.
| Representation | Approximate bytes for a 7B model | Bandwidth consequence |
|---|---|---|
| FP16 | ≈14 GB | Every decode step pulls a very large working set. |
| Q8_0 | ≈7 GB plus metadata | Half the traffic of FP16, same decode semantics. |
| Q4_K | ≈3.5–4 GB plus metadata | Further cuts bytes moved per token. |
Assume:
7B parameters
Q8_0 storage ~= 1 byte per weight
Weight traffic per token ~= 7 GB
If memory bandwidth ~= 500 GB/s:
7 GB / 500 GB/s = 0.014 s
=> about 14 ms just to stream the weights once
That is why GEMV decode is memory-bound.
The compute can be fast and still wait on bytes.FP16 decode load : ~14 GB
Q8_0 decode load : ~7 GB
Q4_K decode load : ~3.5-4 GB
If the bottleneck is memory traffic,
then smaller quantized weights are not just cheaper storage.
They are direct latency leverage.
This is where the scaling page thesis meets silicon reality. If the problem is moving model weights, then an ARM server CPU with many DDR channels and huge memory capacity starts to look like the natural home for large-model single-stream inference. 12 DDR5 channels At roughly 500+ GB/s of aggregate bandwidth, the ARM server memory system lines up with what quantized GEMV actually needs.
Seen through the Theory of Constraints, this is exactly why the CPU-first argument is not anti-accelerator ideology. It is systems arithmetic. Once the model is memory-dominant and deployment is network-bounded, high-bandwidth, high-capacity CPU servers become practical inference machines.
From NEON to Production
The path from today’s three ARM kernels to a broadly ARM-optimized CKE is straightforward. It is real engineering work, but it is not conceptual reinvention. The scalar references already define the math, and x86 already defines the optimization intent. ARM needs the intrinsics layer filled in.
A sensible progression is to land fp32 normalization first, then decode-critical attention, then activation and position kernels, then add a clean SVE2 tier, then exploit bfloat16 on hardware that supports it.
#if defined(__ARM_NEON) || defined(__aarch64__)
rmsnorm_forward_neon(input, gamma, output, D, eps);
layernorm_forward_neon(input, gamma, beta, output, D, eps);
#else
rmsnorm_forward_ref(input, gamma, output, D, eps);
#endiffor (int t = 0; t < seq_len; ++t) {
float32x4_t score_v = vdupq_n_f32(0.0f);
for (int d = 0; d + 4 <= head_dim; d += 4) {
float32x4_t qv = vld1q_f32(q + d);
float32x4_t kv = vld1q_f32(k_cache + t * head_dim + d);
score_v = vfmaq_f32(score_v, qv, kv);
}
scores[t] = vaddvq_f32(score_v);
}#if defined(__ARM_FEATURE_SVE2)
vec_dot_q8_0_q8_0_sve2(n, s, vx, vy);
#elif defined(__ARM_NEON) || defined(__aarch64__)
vec_dot_q8_0_q8_0_neon(n, s, vx, vy);
#else
vec_dot_q8_0_q8_0_ref(n, s, vx, vy);
#endif#if defined(__ARM_FEATURE_SVE2) && defined(__ARM_FEATURE_BF16)
gemm_bf16_sve2(y, W, x, M, N, K);
#else
gemm_fp32_ref(y, W, x, M, N, K);
#endifThe right way to describe the remaining work is “porting and profiling,” not “discovering whether CPU inference is possible.” That discovery phase is already over. CKE’s scalar references and parity discipline mean each new ARM tier can be validated back to a known-good baseline with target error envelopes such as <1e-5 where appropriate.
The last step is empirical: run the new tiers on real hardware. Graviton4, Neoverse-based systems, or a future AGI CPU platform are where compiler codegen, cache behavior, and bandwidth realities become measurable instead of speculative.
The Bigger Picture: CPU-First Across ISAs
The ARM story sits inside a larger architectural pattern. CKE already treats ISA specialization as a first-class design axis: SSE2, AVX, AVX2+FMA, AVX-512, and now NEON are all visible tiers in the codebase. Future tiers such as SVE2, AMX, and RISC-V Vector fit naturally into the same contract.
That contract is why the project is interesting to silicon teams. The question is not merely whether a demo runs. The question is whether the repository makes the mapping from model math to instruction set explicit, correct, and reviewable. CKE increasingly does.
For readers evaluating the larger bet, the companion CKE scaling note is the broader systems argument behind this post. This NEON article is one concrete slice of that thesis: target instruction sets directly, measure the bottleneck, move the hot kernel, and keep the model inside a memory system that can actually serve it.
| ISA tier | Current CKE status | Strategic meaning |
|---|---|---|
| SSE2 / SSSE3 | Present in older x86 fallback kernels | Baseline x86 portability tier. |
| AVX | Present in many fp32 and quantized kernels | First 256-bit x86 widening tier. |
| AVX2+FMA | Heavy production use | Current mainstream x86 CPU AI tier. |
| AVX-512 / VNNI / AMX | Selective advanced kernels | Explicit mapping to newer Intel server/client parts. |
| NEON | 3 quantized kernels today | Real ARM proof point already shipping. |
| SVE2 | Feature-detected, not yet broadly implemented | Natural ARM server expansion tier. |
| RVV | Mentioned in feature detection and roadmap | Future CPU-first portability story beyond ARM/x86. |
/**
* CK-ENGINE KERNEL RULES:
* =======================
* 1. NO malloc/free - memory via bump allocator, pointers passed in
* 2. NO OpenMP - parallelization at orchestrator/codegen layer
* 3. API must define: inputs, outputs, workspace, and memory layouts
* 4. Pure computation - deterministic, no side effects
*/template
-> IR / kernel contract
-> code generation
-> specialized C kernel families
-> ISA-specific intrinsics where profitable
-> scalar fallback where parity must dominateCurrent: SSE2 / AVX / AVX2+FMA / AVX-512 / NEON
Next: SVE2 / AMX / RISC-V Vector
Constant: same algorithm, same kernel contract, different execution widthRepository shape is itself an engineering signal. Eighty-three kernel source files and forty-nine x86-SIMD files tell a silicon reviewer that CKE is treating kernels as a real substrate, not as a thin wrapper around generic libraries. 83 files of kernel C source today. The ARM opportunity is to raise the NEON/SVE2 share of that inventory.
That is also why the x86 SIMD deep dive matters as a companion. It established that CKE already reasons correctly about width, reduction ergonomics, and fused instructions. This ARM NEON walkthrough shows that the same engineering habit already exists on ARM, even if ARM coverage is earlier in its build-out.
Conclusion: An Open Invitation
CKE runs on ARM today. Not as a promise, not as a future branch, and not as a marketing abstraction. It ran Qwen2 and Qwen3 on real Cortex-A72 hardware, and the quantized inner loops discussed above contain real NEON intrinsics.
That is the invitation to ARM’s silicon team. The ISA mapping is already explicit. The algorithmic substrate is already correct. More than eighty kernels are still open for NEON and SVE2 ports, but that is exactly the kind of mechanical, high-leverage work that turns an existence proof into a flagship platform story.
ARM AGI CPU with 136 Neoverse V3 cores, SVE2, bfloat16, 12-channel DDR5, and CXL-backed multi-terabyte memory is not a stretch goal for CKE. It is the natural next environment in which to prove the full CPU-first thesis at server scale. CKE GitHub: https://github.com/antshiv/C-Kernel-Engine · Docs: https://c-kernel-engine.github.io/C-Kernel-Engine/ · Scaling thesis: https://c-kernel-engine.github.io/C-Kernel-Engine/scaling.html · YouTube: youtube.com/@antshivrobotics
Already true today:
- Qwen2 and Qwen3 ran on TI TDA4VM
- CKE ships NEON kernels for Q8_0×Q8_0, Q5_0×Q8_0, Q6_K×Q8_K
- NEON is active in public dispatch for Q8_0 and Q5_0
- Q6_K NEON path is compiled and benchmarkable
Natural next step:
- Port normalization, attention, RoPE, softmax, and activations
- Add SVE2 tier
- Add bf16 kernels
- Validate on Neoverse-class silicon