ARM NEON in C-Kernel-Engine: Real Quantized Kernels on ARM

ARM NEON · C-Kernel-Engine

CKE has already run Qwen2 and Qwen3 on ARM Cortex-A72 hardware (TI TDA4VM). This is not a roadmap post. This is a “what we already ship” post. For ARM silicon teams, the important fact is simple: C-Kernel-Engine already contains production NEON kernel paths for quantized inference, and the companion architectural walkthrough is on youtube.com/@antshivrobotics.

The message to ARM is the same message the SIMD deep dive made for x86 vendors: CKE does not speak about ISA support in abstractions. It names the exact intrinsics, the exact dispatch tier, the exact parity constraint, and the exact kernels that are already hot on real hardware. CPU-first inference is the through-line. The scaling thesis is still 0×∞=0: if the model does not fit in accelerator memory, theoretical accelerator throughput collapses into a systems problem. At the Ethernet boundary, CPUs and GPUs hit the same external bottleneck.

What this post covers

The first part establishes the ARM story: real deployment on TI TDA4VM, what NEON is, and how CKE dispatches into ARM code paths today. The kernel walkthrough then covers the three existing NEON quantized kernels and the reduction patterns they use.

The second part widens the frame: what is still scalar on ARM, what SVE2 on Neoverse V3 unlocks, why bandwidth dominates decode, how compile-time dispatch fits silicon evaluation, and why CKE is an open invitation for ARM AGI CPU bring-up.

CKE Already Runs on ARM

The most important sentence in this post is also the least hypothetical one: CKE has already run Qwen2 and Qwen3 on a TI TDA4VM. That SoC is a Jacinto 7 part with dual Cortex-A72 CPU cores. In other words, the ARM story here is not “NEON would be nice someday.” It is “NEON quantized inference already happened on shipping ARMv8 silicon.”

That matters because the series is written for CPU vendors, not app developers. A developer only needs to know that a model launches. A silicon team wants to know whether the hottest loops are explicit, auditable, and ISA-mapped. In CKE, the answer is yes for the quantized GEMV path that dominates batch-1 decode.

The scaling thesis from the CKE scaling page is the frame around all of this. If the model does not fit in GPU memory, the accelerator conversation instantly becomes a multi-GPU rack conversation. That is where 0×∞=0 stops being a slogan and becomes capital expenditure. A large-memory CPU server can stay in one coherent address space, and the Theory of Constraints says the Ethernet boundary equalizes more of the system than accelerator marketing usually admits.

TDA4VM is modest by datacenter standards, which is exactly why it matters. If quantized Qwen inference works on two Cortex-A72 cores with limited memory, then the “CPU-only is impossible” argument is already broken at the small end before Neoverse-class servers even enter the room. 2 A72 cores The proof point is not peak throughput. It is that the code generation, quantization, compile, and inference stack already closes on ARM silicon.

NEON vs x86 SIMD register comparison showing register widths, counts, and lane configurations across ISA tiers.

Scaling thesis carried into the ARM discussion

text

CPU-only inference thesis:

0 × ∞ = 0

If the model does not fit in accelerator memory,
you do not have an accelerator throughput problem.
You have a system-capacity problem.

Typical implication:
  - Multi-GPU memory solution: expensive, networked, complex
  - Large-memory CPU server: one address space, cheaper RAM scaling

Theory of Constraints:
  Once requests cross the Ethernet boundary,
  the external bottleneck dominates both CPU and GPU deployments.

Kernel contract carried by quantized inference files

/**
 * CK-ENGINE KERNEL RULES:
 * =======================
 * 1. NO malloc/free - memory via bump allocator, pointers passed in
 * 2. NO OpenMP - parallelization at orchestrator/codegen layer
 * 3. API must define: inputs, outputs, workspace, and memory layouts
 * 4. Pure computation - deterministic, no side effects
 */

The CPU-first bet is therefore already visible on ARM. What remains is not to prove viability from scratch. It is to widen ARM kernel coverage beyond the three existing quantized NEON paths so the rest of the decode stack stops falling back to scalar. For ARM reviewers, “already shipped once” is a stronger starting point than any roadmap deck. The hottest quantized inner loop has already touched real Cortex-A72 silicon.

ARM NEON: The ISA Foundation

NEON is ARM64’s universal SIMD baseline: 128-bit vector registers, 32 architectural vector registers v0 through v31, and lane types that cover int8, int16, int32, and fp32. For CKE’s quantized kernels that means the exact operand widths needed for dequant-and-dot work: bytes for packed weights and activations, widened halfwords for intermediate products, then 32-bit accumulators.

The practical beauty of NEON is not that it is exotic. It is the opposite. Every serious ARM64 CPU target already has it, from small Cortex-A cores through Apple Silicon and up to server-class Neoverse. That makes it the natural first ARM tier for CKE, just as SSE2 was the unavoidable x86 baseline in earlier posts.

ISA	Register width	Typical lanes	Register count	What matters for CKE
NEON	128-bit	16×int8 / 8×int16 / 4×fp32	32 vector registers	Universal ARM64 SIMD baseline for quantized GEMV.
SSE2	128-bit	16×int8 / 4×fp32	16 XMM registers in x86-64	Useful width but fewer architectural vector registers.
AVX	256-bit	32×int8 / 8×fp32	16 YMM registers	Wider than NEON, but still manual reduction-heavy.
AVX-512	512-bit	64×int8 / 16×fp32	32 ZMM registers	Wider vectors plus cleaner reductions and richer integer instructions.

NEON register-width intuition

text

NEON register file on AArch64:

v0  = [ lane0 lane1 lane2 lane3 ... lane15 ]
v1  = [ lane0 lane1 lane2 lane3 ... lane15 ]
...
v31 = [ lane0 lane1 lane2 lane3 ... lane15 ]

For the Q8_0 kernel in CKE:
  int8x16_t  = 16 signed bytes at once
  int16x8_t  =  8 widened halfwords at once
  int32x4_t  =  4 accumulated partial sums at once

NEON intrinsic families used by the current CKE kernels

text

Load / duplicate:
  vld1q_s8(ptr)
  vdupq_n_s32(0)

Multiply and widen:
  vmull_s8(a, b)
  vmovl_s8(vget_low_s8(x))
  vmull_s16(a, b)

Accumulate:
  vaddq_s32(acc, x)
  vmlal_s16(acc, a, b)
  vpaddlq_s16(x)

Store / final reduction:
  vst1q_s32(lanes, acc)

From an ISA reviewer’s perspective, one of NEON’s underrated strengths is register count. ARM64 gives the software 32 vector registers even at 128-bit width, which is a better register story than classic SSE and one reason ARM code can stage unpacked and scaled data cleanly. 32 × 128-bit That is twice the vector register count of x86-64 SSE-era XMM state, even though both tiers are 128-bit wide.

CKE kernel coverage showing NEON-optimized quantized GEMV versus scalar fallback for other kernel families.

For CKE, that universality matters more than peak width. Once a quantized kernel is written in NEON intrinsics, every AArch64 deployment tier immediately gets a concrete SIMD implementation instead of a scalar apology.

The Dispatch Pattern: How CKE Selects ARM NEON

CKE’s dispatch model is intentionally plain. There is no JIT, no opaque runtime codegen, and no mystery about which tier is active. The production pattern is compile-time #if / #elif selection. On x86 that means AVX-512, then AVX2, then AVX, then SSE. On ARM64 it means the compiler’s __aarch64__ and __ARM_NEON macros expose the NEON tier directly.

The key operational point is that NEON is in the actual dispatch chain today. For Q8_0 × Q8_0 and Q5_0 × Q8_0 it is not a dead stub. It is the active selected path on ARM builds. For Q6_K × Q8_K the NEON kernel is compiled and callable, but the public dispatch remains pinned to the scalar reference for llama.cpp-style reduction-order parity.

Q8_0 × Q8_0 dispatch tier — NEON is a live branch

void vec_dot_q8_0_q8_0(int n, float *s, const void *vx, const void *vy)
{
    const char *ref_env = getenv("CK_DEBUG_Q8_0_Q8_0_REF");
    if (ref_env && ref_env[0] && ref_env[0] != '0') {
        vec_dot_q8_0_q8_0_ref(n, s, vx, vy);
        return;
    }
#ifdef __AVX512F__
    vec_dot_q8_0_q8_0_avx512(n, s, vx, vy);
#elif defined(__AVX2__)
    vec_dot_q8_0_q8_0_avx2(n, s, vx, vy);
#elif defined(__ARM_NEON) || defined(__aarch64__)
    vec_dot_q8_0_q8_0_neon(n, s, vx, vy);
#elif defined(__AVX__)
    vec_dot_q8_0_q8_0_avx(n, s, vx, vy);
#elif defined(__SSE4_1__)
    vec_dot_q8_0_q8_0_sse(n, s, vx, vy);
#else
    vec_dot_q8_0_q8_0_ref(n, s, vx, vy);
#endif
}

Feature detection in ck_features.h — ARM hooks already exist

/* ARM */
#if defined(__aarch64__)
    #if defined(__ARM_FEATURE_SVE2)
        #define CK_HAS_SVE2 1
    #endif
    #if defined(__ARM_FEATURE_NEON)
        #define CK_HAS_NEON 1
    #endif
#endif

/* Best available vector width */
#if defined(CK_HAS_AMX)
    #define CK_VECTOR_WIDTH 512
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX512)
    #define CK_VECTOR_WIDTH 512
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX2_FMA)
    #define CK_VECTOR_WIDTH 256
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX)
    #define CK_VECTOR_WIDTH 256
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_NEON)
    #define CK_VECTOR_WIDTH 128
    #define CK_HAS_BEST_VECTOR 1
#else
    #define CK_VECTOR_WIDTH 32  /* Scalar fallback */
    #define CK_HAS_BEST_VECTOR 0
#endif

Compile-time hierarchy discussed in this post

text

AVX-512
  ↓
AVX2
  ↓
ARM NEON
  ↓
AVX
  ↓
SSE4.1 / SSSE3
  ↓
Scalar reference

On AArch64:
  compiler defines __aarch64__
  include <arm_neon.h>
  select *_neon() implementation where present

This is exactly the kind of dispatch code a silicon team wants to inspect: short, deterministic, and easy to force back to reference. There is nowhere for implementation truth to hide. Cross-compile for AArch64 and the quantized GEMV inner loop immediately stops being scalar wherever a NEON branch exists. That is the entire practical meaning of “ARM support” at kernel level.

Compile-time dispatch hierarchy showing AVX-512 to AVX2 to ARM NEON to AVX to SSE to scalar fallback.

Just as in the SIMD deep dive, the kernel rules matter here too: no malloc, no OpenMP inside the kernel, pure computation only. That constraint is what keeps ISA dispatch reviewable instead of burying hardware decisions inside runtime orchestration noise.

Q8_0 × Q8_0: The Cleanest NEON Kernel

Of the three ARM paths, vec_dot_q8_0_q8_0_neon() is the cleanest expression of the idea. There is no nibble unpacking, no high-bit merge, and no extra scale remapping. It is just a blockwise int8 dot product: load bytes, multiply pairs into int16, pairwise widen-accumulate into int32, then collapse the four accumulator lanes into one scalar.

This is the ARM version of the same systems message the SIMD deep dive made about x86: the algorithm is not changing. The ISA surface is changing. NEON and AVX2 are two different ways of spelling the same math.

Full Q8_0 × Q8_0 NEON kernel from CKE

#if defined(__ARM_NEON) || defined(__aarch64__)
void vec_dot_q8_0_q8_0_neon(int n, float *s, const void *vx, const void *vy)
{
    const int qk = QK8_0;
    const int nb = n / qk;

    const block_q8_0 *x = (const block_q8_0 *)vx;
    const block_q8_0 *y = (const block_q8_0 *)vy;

    float sumf = 0.0f;

    for (int ib = 0; ib < nb; ib++) {
        const int8x16_t x0 = vld1q_s8(&x[ib].qs[0]);
        const int8x16_t x1 = vld1q_s8(&x[ib].qs[16]);
        const int8x16_t y0 = vld1q_s8(&y[ib].qs[0]);
        const int8x16_t y1 = vld1q_s8(&y[ib].qs[16]);

        int32x4_t acc = vdupq_n_s32(0);

        const int16x8_t p0 = vmull_s8(vget_low_s8(x0), vget_low_s8(y0));
        const int16x8_t p1 = vmull_s8(vget_high_s8(x0), vget_high_s8(y0));
        const int16x8_t p2 = vmull_s8(vget_low_s8(x1), vget_low_s8(y1));
        const int16x8_t p3 = vmull_s8(vget_high_s8(x1), vget_high_s8(y1));

        acc = vaddq_s32(acc, vpaddlq_s16(p0));
        acc = vaddq_s32(acc, vpaddlq_s16(p1));
        acc = vaddq_s32(acc, vpaddlq_s16(p2));
        acc = vaddq_s32(acc, vpaddlq_s16(p3));

        int32_t lanes[4];
        vst1q_s32(lanes, acc);
        const int sumi = lanes[0] + lanes[1] + lanes[2] + lanes[3];

        sumf += (float)sumi * (CK_FP16_TO_FP32(x[ib].d) * CK_FP16_TO_FP32(y[ib].d));
    }

    *s = sumf;
}
#endif

Q8_0 NEON data flow, step by step

text

Load phase:
  vld1q_s8(&x[ib].qs[0])   -> 16 activation bytes
  vld1q_s8(&y[ib].qs[0])   -> 16 weight bytes

Multiply phase:
  vmull_s8(low8, low8)     -> 8 int16 products
  vmull_s8(high8, high8)   -> 8 int16 products

Pairwise accumulate:
  vpaddlq_s16(p0)          -> 4 int32 partial sums
  vpaddlq_s16(p1)          -> 4 int32 partial sums

Block reduction:
  vaddq_s32(acc, partial)
  vst1q_s32(lanes, acc)
  lanes[0] + lanes[1] + lanes[2] + lanes[3]

AVX2 path for the same kernel — same algorithm, different ISA vocabulary

for (; ib < nb; ++ib) {
    const __m256 d = _mm256_set1_ps(CK_FP16_TO_FP32(x[ib].d) * CK_FP16_TO_FP32(y[ib].d));
    const __m256i qx = _mm256_loadu_si256((const __m256i *)x[ib].qs);
    const __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
    const __m256 q = mul_sum_i8_pairs_float_q8_0_avx2(qx, qy);
#if defined(__FMA__)
    acc = _mm256_fmadd_ps(d, q, acc);
#else
    acc = _mm256_add_ps(_mm256_mul_ps(d, q), acc);
#endif
}

sumf = hsum_float_8_q8_0(acc);

Intrinsic	What it does in the NEON path	Why it matters
`vld1q_s8`	Loads 16 signed int8 lanes.	Moves packed quantized values directly into vector state.
`vmull_s8`	Multiplies 8 int8 lanes and widens to int16.	Avoids scalar unpack loops.
`vpaddlq_s16`	Pairwise add and widen int16 to int32.	Collapses 8 products into 4 safer accumulator lanes.
`vaddq_s32`	Adds int32 vectors.	Maintains block accumulation in-register.
`vst1q_s32`	Stores the final four accumulator lanes.	Enables the manual horizontal sum CKE uses today.

There is no hand-waving here. The kernel is an explicit int8×int8 dot product expressed in NEON intrinsics. If an ARM reviewer wants to understand CKE’s current ARM competence in one screenful, this is the screenful to start with. The x86 SIMD deep dive showed the same idea on Intel and AMD instruction sets. The ARM takeaway is not that NEON equals AVX2 in width. It is that CKE knows how to map the same quantized contract into ARM’s own widening-and-reduction idioms.

Q8_0 NEON kernel data flow from int8 loads through vmull, vpaddl, and manual lane reduction.

Q5_0 × Q8_0: Bit-Unpacking Plus NEON

Q5_0 × Q8_0 adds the complication that makes quantized CPU kernels feel like real systems work rather than textbook linear algebra: the weights are not stored as straightforward signed bytes. They are stored as 5-bit values split between low nibbles and a separate high-bit field. The NEON compute core is still elegant, but there is scalar unpack work before the vector multiply starts.

This is exactly why ISA reviewers should care about file-level kernel competence. Real quantized inference is not a single pretty dot instruction. It is layout decode plus arithmetic plus reduction plus parity policy.

Horizontal-sum helper reused by the Q5_0 NEON path

#if defined(__ARM_NEON) || defined(__aarch64__)
static inline int32_t ck_hsum_s32x4(int32x4_t v)
{
    int32_t lanes[4];
    vst1q_s32(lanes, v);
    return lanes[0] + lanes[1] + lanes[2] + lanes[3];
}
#endif

Q5_0 unpack loop — low nibble plus high-bit field become signed int8 values

uint32_t qh;
memcpy(&qh, x[ib].qh, sizeof(qh));

int8_t wvals[QK5_0];
for (int j = 0; j < qk / 2; j++) {
    const uint8_t packed = x[ib].qs[j];
    const uint8_t xh_0 = ((qh >> (j + 0)) & 1u) << 4;
    const uint8_t xh_1 = ((qh >> (j + 16)) & 1u) << 4;

    wvals[j] = (int8_t)(((packed & 0x0F) | xh_0) - 16);
    wvals[j + qk / 2] = (int8_t)(((packed >> 4) | xh_1) - 16);
}

Full Q5_0 × Q8_0 NEON kernel from CKE

void vec_dot_q5_0_q8_0_neon(int n, float *s, const void *vx, const void *vy)
{
    const int qk = QK5_0;
    const int nb = n / qk;

    const block_q5_0 *x = (const block_q5_0 *)vx;
    const block_q8_0 *y = (const block_q8_0 *)vy;

    float sumf = 0.0f;

    for (int ib = 0; ib < nb; ib++) {
        uint32_t qh;
        memcpy(&qh, x[ib].qh, sizeof(qh));

        int8_t wvals[QK5_0];
        for (int j = 0; j < qk / 2; j++) {
            const uint8_t packed = x[ib].qs[j];
            const uint8_t xh_0 = ((qh >> (j + 0)) & 1u) << 4;
            const uint8_t xh_1 = ((qh >> (j + 16)) & 1u) << 4;

            wvals[j] = (int8_t)(((packed & 0x0F) | xh_0) - 16);
            wvals[j + qk / 2] = (int8_t)(((packed >> 4) | xh_1) - 16);
        }

        const int8x16_t w0 = vld1q_s8(&wvals[0]);
        const int8x16_t w1 = vld1q_s8(&wvals[16]);
        const int8x16_t x0 = vld1q_s8(&y[ib].qs[0]);
        const int8x16_t x1 = vld1q_s8(&y[ib].qs[16]);

        int32x4_t acc = vdupq_n_s32(0);

        int16x8_t p0 = vmull_s8(vget_low_s8(w0), vget_low_s8(x0));
        int16x8_t p1 = vmull_s8(vget_high_s8(w0), vget_high_s8(x0));
        int16x8_t p2 = vmull_s8(vget_low_s8(w1), vget_low_s8(x1));
        int16x8_t p3 = vmull_s8(vget_high_s8(w1), vget_high_s8(x1));

        acc = vaddq_s32(acc, vpaddlq_s16(p0));
        acc = vaddq_s32(acc, vpaddlq_s16(p1));
        acc = vaddq_s32(acc, vpaddlq_s16(p2));
        acc = vaddq_s32(acc, vpaddlq_s16(p3));

        const float d = CK_FP16_TO_FP32(x[ib].d) * CK_FP16_TO_FP32(y[ib].d);
        sumf += d * (float)ck_hsum_s32x4(acc);
    }

    *s = sumf;
}

Q5_0 dispatch tier — NEON is active on ARM

void vec_dot_q5_0_q8_0(int n, float *s, const void *vx, const void *vy)
{
#if defined(__AVX512F__)
    vec_dot_q5_0_q8_0_avx512(n, s, vx, vy);
#elif defined(__AVX2__)
    vec_dot_q5_0_q8_0_avx2(n, s, vx, vy);
#elif defined(__ARM_NEON) || defined(__aarch64__)
    vec_dot_q5_0_q8_0_neon(n, s, vx, vy);
#elif defined(__AVX__)
    vec_dot_q5_0_q8_0_avx(n, s, vx, vy);
#elif defined(__SSSE3__)
    vec_dot_q5_0_q8_0_sse(n, s, vx, vy);
#else
    vec_dot_q5_0_q8_0_ref(n, s, vx, vy);
#endif
}

The key difference from Q8_0 is representational overhead. Q5_0 requires unpack work because 5-bit weights do not line up cleanly on byte boundaries, so the kernel first reconstructs signed int8 values and then hands the multiply to NEON. 5-bit weights This is why quantized kernel engineering is half data-layout archaeology and half SIMD arithmetic.

Memory bandwidth comparison across Alder Lake, Xeon, ARM AGI CPU, and Graviton4 platforms.

The compute sequence after unpack is familiar: vld1q_s8 → vmull_s8 → vpaddlq_s16 → vaddq_s32. That repetition is valuable. It means once the format-specific decode is done, the arithmetic surface returns to a stable ARM pattern.

Q6_K × Q8_K: The Most Complex NEON Kernel

Q6_K × Q8_K is the most important ARM kernel in this post because it is closer to large-model decode reality. The Q6_K format packs 256 weights per block, splits each 6-bit value across low and high fields, and applies per-sub-block scales. In other words, it looks much more like the actual quantized inference work a CPU stack must do for modern LLMs.

The NEON path here is more elaborate than the clean Q8_0 kernel because it performs a three-stage pipeline: reconstruct the 6-bit weights, align sub-block scales, and then do the widened multiply-accumulate sequence against the Q8 activations.

Q6_K file comment — packed-format contract

/**
 * Q6_K Format (256 weights per block):
 *   - d: FP16 super-block scale
 *   - ql: 128 bytes (low 4 bits of each weight)
 *   - qh: 64 bytes (high 2 bits of each weight)
 *   - scales: 16 int8 sub-block scales
 *
 * Q8_K Format (256 weights per block):
 *   - d: FP32 scale
 *   - qs: 256 int8 values
 *   - bsums: 16 int16 block sums
 */

Q6_K × Q8_K NEON kernel from CKE

#if defined(__ARM_NEON) || defined(__aarch64__)
static float dot_q6_k_q8_k_neon(const block_q6_K *w,
                                const block_q8_K *x,
                                int K)
{
    const int nb = K / QK_K;
    float sumf = 0.0f;

    for (int i = 0; i < nb; ++i) {
        const float d = GGML_FP16_TO_FP32(w[i].d) * x[i].d;

        const uint8_t *ql = w[i].ql;
        const uint8_t *qh = w[i].qh;
        const int8_t *sc = w[i].scales;
        const int8_t *q8 = x[i].qs;

        int8_t wvals[QK_K];
        int8_t svals[QK_K];

        for (int n = 0; n < QK_K; n += 128) {
            for (int l = 0; l < 32; ++l) {
                const int is = l / 16;

                const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
                const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
                const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
                const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;

                const int base = n;
                wvals[base + l + 0] = q1;
                wvals[base + l + 32] = q2;
                wvals[base + l + 64] = q3;
                wvals[base + l + 96] = q4;

                svals[base + l + 0] = sc[is + 0];
                svals[base + l + 32] = sc[is + 2];
                svals[base + l + 64] = sc[is + 4];
                svals[base + l + 96] = sc[is + 6];
            }

            ql += 64;
            qh += 32;
            sc += 8;
        }

        int32x4_t acc = vdupq_n_s32(0);
        for (int j = 0; j < QK_K; j += 16) {
            const int8x16_t wv = vld1q_s8(&wvals[j]);
            const int8x16_t sv = vld1q_s8(&svals[j]);
            const int8x16_t xv = vld1q_s8(&q8[j]);

            const int16x8_t ws0 = vmull_s8(vget_low_s8(wv), vget_low_s8(sv));
            const int16x8_t ws1 = vmull_s8(vget_high_s8(wv), vget_high_s8(sv));
            const int16x8_t x0 = vmovl_s8(vget_low_s8(xv));
            const int16x8_t x1 = vmovl_s8(vget_high_s8(xv));

            int32x4_t p0 = vmull_s16(vget_low_s16(ws0), vget_low_s16(x0));
            p0 = vmlal_s16(p0, vget_high_s16(ws0), vget_high_s16(x0));

            int32x4_t p1 = vmull_s16(vget_low_s16(ws1), vget_low_s16(x1));
            p1 = vmlal_s16(p1, vget_high_s16(ws1), vget_high_s16(x1));

            acc = vaddq_s32(acc, p0);
            acc = vaddq_s32(acc, p1);
        }

        int32_t lanes[4];
        vst1q_s32(lanes, acc);
        sumf += d * (float)(lanes[0] + lanes[1] + lanes[2] + lanes[3]);
    }

    return sumf;
}
#endif

Public Q6_K × Q8_K dispatch — scalar by default for parity, NEON still compiled

void vec_dot_q6_k_q8_k(int n, float *s, const void *vx, const void *vy)
{
    if (!s || !vx || !vy || n <= 0) {
        return;
    }

    const block_q6_K *x = (const block_q6_K *)vx;
    const block_q8_K *y = (const block_q8_K *)vy;

    /* Keep the public dispatch on the llama-compatible reduction order.
     * The SIMD variants are still exported for direct tests/benchmarks, but
     * their horizontal reduction order can move borderline logits in Qwen3.5
     * long-decode parity. */
    *s = dot_q6_k_q8_k_ref(x, y, n);
}

How the Q6_K NEON path differs from the Q8_0 path

text

Q8_0 path:
  int8 load
  vmull_s8
  vpaddlq_s16
  vaddq_s32
  store lanes + scalar sum

Q6_K path:
  unpack ql + qh into 6-bit signed bytes
  map sub-block scales into svals[]
  vmull_s8(weight, scale)
  vmovl_s8(activation)
  vmull_s16 / vmlal_s16
  vaddq_s32
  store lanes + scalar sum

This is the kernel that makes the TDA4VM story matter. It is not just a toy byte-dot loop. It is a production-style packed-format decode kernel with explicit NEON arithmetic in the hottest part of inference. The public entry point stays on the scalar reduction order for parity, but that does not erase the engineering signal. The NEON kernel exists, compiles, and is available for direct tests and benchmarks.

Scaling from TDA4VM dual A72 to Graviton4 to ARM AGI CPU with 136 Neoverse V3 cores.

The important design choice is that the algorithm stays visible. Nothing about this kernel is opaque to hardware review. Every unpack, widen, multiply, and reduction stage is spelled in ordinary C plus NEON intrinsics.

The Horizontal Reduction Problem

All three ARM kernels in this post eventually hit the same awkward point: SIMD vectors are great at lane-local work, but the kernel finally needs one scalar answer. Today CKE handles that with the most portable approach possible on NEON: store the accumulator lanes to memory and add them in scalar code.

This is not a design flaw. It is a revealing ISA detail. In the x86 SIMD discussion, AVX-512 made reductions much cleaner than AVX1. ARM has a similar story: baseline NEON often uses manual lane collapse, while later ARMv8.1+ cores expose helpers like vaddvq_s32 that make the final reduction tighter.

Manual horizontal reduction in the current NEON kernels

int32_t lanes[4];
vst1q_s32(lanes, acc);
const int sumi = lanes[0] + lanes[1] + lanes[2] + lanes[3];

AVX1 helper — x86 had the same problem before AVX-512

#if defined(__AVX__) && !defined(__AVX512F__)
static inline float hsum256_ps_rmsnorm(__m256 v) {
    __m128 hi = _mm256_extractf128_ps(v, 1);
    __m128 lo = _mm256_castps256_ps128(v);
    __m128 sum128 = _mm_add_ps(lo, hi);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    return _mm_cvtss_f32(sum128);
}
#endif

AVX-512 reduction ergonomics from the SIMD deep dive

__m512 sum_sq_vec = _mm512_setzero_ps();
for (; d + 16 <= D; d += 16) {
    __m512 xv = _mm512_loadu_ps(&x[d]);
    sum_sq_vec = _mm512_fmadd_ps(xv, xv, sum_sq_vec);
}
float sum_sq = _mm512_reduce_add_ps(sum_sq_vec);

A small NEON cleanup opportunity for newer ARM cores

#if defined(__ARM_FEATURE_QRDMX) || defined(__aarch64__)
static inline int32_t ck_hsum_s32x4_fast(int32x4_t v)
{
    return vaddvq_s32(v);
}
#endif

Tier	Reduction method	Consequence
Current CKE NEON	`vst1q_s32` then scalar lane add	Portable and correct, but a little clunky.
AVX1 / AVX2	Custom helper with extracts and horizontal adds	Similar manual choreography.
AVX-512	`_mm512_reduce_add_ps`	Reduction is explicit in the ISA surface.
ARMv8.1+ option	`vaddvq_s32`	Cleaner NEON reduction without changing kernel math.

Reduction is one of the most telling places to compare ISAs because it exposes whether the hardware sees ML-style accumulations as first-class patterns or as something software still has to stitch together by hand. 1 instruction That is the practical appeal of AVX-512 reduce-add and modern ARM horizontal-reduction helpers: less choreography in the hottest collapse step.

What Is Not NEON-Optimized Yet

The honest ARM story is not “CKE is fully optimized on ARM.” The honest story is much more interesting: ARM already has three real quantized kernels, but roughly eighty other kernel files still fall back to scalar on ARM even when x86 has SIMD specializations. That is exactly the kind of partially-complete but clearly-real state that tells a silicon team where joint work could matter.

The first order bit is coverage. The second is priority. Not every kernel deserves equal attention. Decode-critical normalization, attention, activation, and RoPE kernels should be ported before long-tail utilities.

Inventory slice	Count	Meaning
Total kernel `.c` files	83	CKE is already a broad kernel repository, not a single-demo codebase.
Files with x86 SIMD intrinsics	49	x86 path is already materially ISA-specialized.
Files with ARM NEON intrinsics	3	ARM coverage exists today, but only for quantized GEMV/dot.
Files still scalar on ARM	≈80	Main opportunity for NEON and SVE2 expansion.

Priority	Kernel family	Why ARM should care
1	`rmsnorm_kernels.c`	Every layer touches it; current ARM fallback is scalar.
2	`attention_kernels.c`	258 x86 intrinsics already exist; decode latency depends on it.
3	`swiglu_kernels.c` / `gelu_kernels.c`	MLP activation bandwidth and FMA density matter per token.
4	`rope_kernels.c`	Position embedding math is regular and mechanically portable.
5	`softmax_kernels.c`	Reduction-heavy and numerically sensitive, but still formulaically straightforward.

What an ARM NEON RMSNorm port would look like

void rmsnorm_forward_neon(const float *x,
                          const float *gamma,
                          float *y,
                          int D,
                          float eps)
{
    float32x4_t sum_sq_v = vdupq_n_f32(0.0f);
    int d = 0;
    for (; d + 4 <= D; d += 4) {
        float32x4_t xv = vld1q_f32(&x[d]);
        sum_sq_v = vfmaq_f32(sum_sq_v, xv, xv);
    }
    float sum_sq = vaddvq_f32(sum_sq_v);
    for (; d < D; ++d) {
        sum_sq += x[d] * x[d];
    }

    const float rstd = 1.0f / sqrtf(sum_sq / (float)D + eps);
    float32x4_t rstd_v = vdupq_n_f32(rstd);

    d = 0;
    for (; d + 4 <= D; d += 4) {
        float32x4_t xv = vld1q_f32(&x[d]);
        float32x4_t gv = vld1q_f32(&gamma[d]);
        float32x4_t yv = vmulq_f32(vmulq_f32(xv, rstd_v), gv);
        vst1q_f32(&y[d], yv);
    }
    for (; d < D; ++d) {
        y[d] = x[d] * rstd * gamma[d];
    }
}

Attention decode porting sketch — mechanical, not algorithmic

for (int h = 0; h < num_heads; ++h) {
    float32x4_t acc0 = vdupq_n_f32(0.0f);
    for (int d = 0; d + 4 <= head_dim; d += 4) {
        float32x4_t qv = vld1q_f32(&q[h * head_dim + d]);
        float32x4_t kv = vld1q_f32(&k_cache[token * head_dim + d]);
        acc0 = vfmaq_f32(acc0, qv, kv);
    }
    scores[h] = vaddvq_f32(acc0) * inv_sqrt_d;
}

Softmax / activation kernels follow the same translation pattern

// RoPE sketch
float32x4_t xv = vld1q_f32(x + d);
float32x4_t cv = vld1q_f32(cos_t + d);
float32x4_t sv = vld1q_f32(sin_t + d);
float32x4_t y0 = vfmsq_f32(vmulq_f32(xv, cv), rot, sv);

// SwiGLU sketch
float32x4_t a = vld1q_f32(gate + d);
float32x4_t b = vld1q_f32(up + d);
float32x4_t sig = sigmoid4_neon(a);
float32x4_t out = vmulq_f32(vmulq_f32(a, sig), b);

The ARM opportunity is therefore not “invent new math.” It is port existing correct math into ARM intrinsics, starting with the kernels that sit in every decode step. That is why the remaining work is best described as mechanical rather than algorithmic. Scalar correctness already exists. The task is to change the execution surface.

SVE2: What ARM AGI CPU Unlocks

If NEON is the first ARM proof point, SVE2 is the architecture-scale destination. The jump is not just wider vectors. It is scalable vector length, predication, and native support for patterns that look much more like modern ML loops. That matters because the target hardware under discussion is not a phone core. It is ARM’s AGI CPU class, built around Neoverse V3: 136 cores, SVE2, native bfloat16, 12 DDR5 channels at 8800 MT/s, and CXL 3.0 for memory expansion.

For CKE, SVE2 means one source-level vector algorithm can become vector-length agnostic. The same binary can run on 128-bit SVE implementations and on much wider ones without rewriting the cleanup loop for each width. That is a big deal for normalization, activation, and decode kernels whose bodies are structurally regular.

ARM AGI CPU trait	Why it matters to CKE
136 Neoverse V3 cores	Token decode is embarrassingly parallel across sessions even when a single stream is bandwidth-bound.
SVE2	Predicated vector loops remove a lot of manual tail handling.
Native bfloat16	Opens a serious path for bf16 kernels beyond fp32 scalar fallback.
12× DDR5 8800 MT/s	Exactly the resource memory-bound GEMV wants.
CXL 3.0	Multi-terabyte memory scaling aligns with CKE’s CPU-first thesis.

ck_features.h already knows about SVE2

#if defined(__aarch64__)
    #if defined(__ARM_FEATURE_SVE2)
        #define CK_HAS_SVE2 1
    #endif
    #if defined(__ARM_FEATURE_NEON)
        #define CK_HAS_NEON 1
    #endif
#endif

Hypothetical SVE2 RMSNorm kernel — same math, scalable vectors

#include <arm_sve.h>

void rmsnorm_forward_sve2(const float *x,
                          const float *gamma,
                          float *y,
                          int D,
                          float eps)
{
    svfloat32_t sum_sq = svdup_f32(0.0f);
    int d = 0;
    while (d < D) {
        svbool_t pg = svwhilelt_b32(d, D);
        svfloat32_t xv = svld1_f32(pg, &x[d]);
        sum_sq = svfmla_f32_z(pg, sum_sq, xv, xv);
        d += svcntw();
    }

    float total = svaddv_f32(svptrue_b32(), sum_sq);
    const float rstd = 1.0f / sqrtf(total / (float)D + eps);
    svfloat32_t rstd_v = svdup_f32(rstd);

    d = 0;
    while (d < D) {
        svbool_t pg = svwhilelt_b32(d, D);
        svfloat32_t xv = svld1_f32(pg, &x[d]);
        svfloat32_t gv = svld1_f32(pg, &gamma[d]);
        svfloat32_t yv = svmul_f32_z(pg, svmul_f32_z(pg, xv, rstd_v), gv);
        svst1_f32(pg, &y[d], yv);
        d += svcntw();
    }
}

Why SVE2 feels different in practice

text

Predication primitives:
  svptrue_b32()       -> all lanes active
  svwhilelt_b32(i, D) -> tail predicate, no scalar cleanup loop

Predicated memory:
  svld1_f32(pg, ptr)
  svst1_f32(pg, ptr)

Predicated arithmetic:
  svfmla_f32_z(pg, acc, a, b)
  svmul_f32_z(pg, a, b)

Bfloat16 opportunity:
  svbfmmla(...)       -> bf16 matrix multiply-accumulate on supporting cores

bf16-oriented SVE2 sketch for future CKE kernels

svbool_t pg = svptrue_b16();
svbfloat16_t av = svld1_bf16(pg, a_ptr);
svbfloat16_t bv = svld1_bf16(pg, b_ptr);
svfloat32_t acc = svdup_f32(0.0f);
acc = svbfmmla_f32(acc, av, bv);

The ARM AGI CPU target is compelling precisely because it aligns ISA evolution with memory-system scale. SVE2 gives cleaner vector semantics; Neoverse V3 gives the core count and DDR bandwidth that decode workloads actually consume. 136 cores in the target Neoverse V3 class. That is enough parallel host capacity to turn CKE from “it runs” into “it scales.”

Heatmap of NEON intrinsics used across Q8_0, Q5_0, and Q6_K kernels.

In other words, NEON proves CKE already speaks ARM. SVE2 is what would let it speak ARM at flagship server scale.

The TDA4VM Story: Where This Already Works

The TI TDA4VM matters because it strips the conversation of excuses. This is not a giant 500-watt server CPU. It is a dual-core Cortex-A72 SoC from a prior generation, with NEON and a constrained memory budget. Yet Qwen2 and Qwen3 were run on it through CKE’s quantized path.

That means the whole pipeline closed: model path selection, quantized weights, generated C kernels, ARM compile, and actual inference. Most importantly, it means the quantized GEMV inner loop was not a scalar-only placeholder. The NEON kernels discussed above were in the relevant thermal and latency path.

Platform	CPU profile	What it proves
TI TDA4VM	2× Cortex-A72, ARMv8.0, NEON, limited RAM	CKE already runs real quantized Qwen inference on ARM.
ARM AGI CPU / Neoverse V3	136 cores, SVE2, bf16, 12-channel DDR5, CXL	Natural next target for scaled CPU-only deployment.

The pipeline that has already been exercised on ARM

text

model definition
  -> template / IR selection
  -> kernel code generation
  -> C source with quantized kernels
  -> AArch64 compile
  -> deploy on TI TDA4VM
  -> run Qwen2 / Qwen3 inference

What an ARM evaluation build matrix can look like

text

# Current proof tier
clang -O3 -march=armv8-a+simd -ffast-math -c gemm_kernels_q8_0.c

# Neoverse-class NEON tier
clang -O3 -mcpu=neoverse-v2 -c gemm_kernels_q5_0.c

# Future SVE2 tier
clang -O3 -march=armv9-a+sve2+bf16 -c rmsnorm_kernels.c

A dual-core 2017-era ARM SoC already ran real Qwen models. That is the fact ARM should keep in mind before anyone says CPU inference on ARM is speculative. If the proof point already exists on Cortex-A72, the real question is what 136 Neoverse V3 cores plus SVE2 plus multi-terabyte DDR5 unlock.

Compile-Time vs Runtime Dispatch

CKE today is mostly a compile-time dispatch codebase. That means separate binaries can be produced for different ISA tiers without paying runtime branch cost inside the hot kernel entry. For silicon evaluation this is often ideal: build one binary for NEON, another for SVE2, and compare them cleanly.

There is, of course, a runtime-dispatch alternative. ARM Linux exposes hardware capability bits through getauxval(AT_HWCAP) and AT_HWCAP2. A single binary can inspect those flags and choose between scalar, NEON, SVE, and SVE2 paths. That is more flexible for broad distribution, but it complicates benchmarking and leaves an extra branch surface in the dispatch layer.

Compile-time dispatch philosophy

#if defined(__ARM_FEATURE_SVE2)
    attention_decode_sve2(...);
#elif defined(__ARM_NEON) || defined(__aarch64__)
    attention_decode_neon(...);
#else
    attention_decode_ref(...);
#endif

Runtime dispatch alternative on ARM Linux

#include <sys/auxv.h>
#include <asm/hwcap.h>

void attention_decode_auto(...)
{
    unsigned long hwcap = getauxval(AT_HWCAP);
    unsigned long hwcap2 = getauxval(AT_HWCAP2);

    if (hwcap2 & HWCAP2_SVE2) {
        attention_decode_sve2(...);
    } else if (hwcap & HWCAP_ASIMD) {
        attention_decode_neon(...);
    } else {
        attention_decode_ref(...);
    }
}

Why compile-time binaries are good for vendor bring-up

text

Target A: armv8-a + NEON
  - deterministic codegen
  - zero runtime ISA branch cost
  - easy perf counter attribution

Target B: armv9-a + SVE2 + bf16
  - deterministic codegen
  - direct apples-to-apples ISA comparison
  - simpler regression tracking across compiler versions

Approach	Strength	Weakness
Compile-time dispatch	Deterministic and zero-overhead in the hot path.	Requires multiple binaries for multiple ISA tiers.
Runtime dispatch	One binary can span more machines.	Adds branch surface and benchmarking complexity.

For a silicon team, compile-time dispatch is usually a feature, not a limitation. It makes the mapping between compiler flags, intrinsics, and perf counters explicit. That is especially true when evaluating a new ISA tier like SVE2: one optimized binary per tier is exactly what the bring-up workflow wants.

Memory Bandwidth: Why ARM Server CPUs Matter for GEMV

Once the math is vectorized, decode becomes a bandwidth story. Batch-1 LLM inference is dominated by repeatedly streaming weights through GEMV-like kernels. That puts the workload on the flat, memory-bound side of the roofline model. Wider SIMD helps, but only until memory traffic becomes the real limiter.

This is why the AGI CPU specification matters so much. Twelve DDR5 channels at 8800 MT/s and CXL memory expansion are not generic datacenter bragging points. They are exactly the system resources a CPU-first quantized inference engine can actually monetize.

Representation	Approximate bytes for a 7B model	Bandwidth consequence
FP16	≈14 GB	Every decode step pulls a very large working set.
Q8_0	≈7 GB plus metadata	Half the traffic of FP16, same decode semantics.
Q4_K	≈3.5–4 GB plus metadata	Further cuts bytes moved per token.

Back-of-envelope decode bandwidth math

text

Assume:
  7B parameters
  Q8_0 storage ~= 1 byte per weight
  Weight traffic per token ~= 7 GB

If memory bandwidth ~= 500 GB/s:
  7 GB / 500 GB/s = 0.014 s
  => about 14 ms just to stream the weights once

That is why GEMV decode is memory-bound.
The compute can be fast and still wait on bytes.

What quantization buys in the bandwidth regime

text

FP16 decode load : ~14 GB
Q8_0 decode load :  ~7 GB
Q4_K decode load : ~3.5-4 GB

If the bottleneck is memory traffic,
then smaller quantized weights are not just cheaper storage.
They are direct latency leverage.

Porting roadmap showing effort levels for NEON and SVE2 across CKE kernel families.

This is where the scaling page thesis meets silicon reality. If the problem is moving model weights, then an ARM server CPU with many DDR channels and huge memory capacity starts to look like the natural home for large-model single-stream inference. 12 DDR5 channels At roughly 500+ GB/s of aggregate bandwidth, the ARM server memory system lines up with what quantized GEMV actually needs.

Seen through the Theory of Constraints, this is exactly why the CPU-first argument is not anti-accelerator ideology. It is systems arithmetic. Once the model is memory-dominant and deployment is network-bounded, high-bandwidth, high-capacity CPU servers become practical inference machines.

From NEON to Production

The path from today’s three ARM kernels to a broadly ARM-optimized CKE is straightforward. It is real engineering work, but it is not conceptual reinvention. The scalar references already define the math, and x86 already defines the optimization intent. ARM needs the intrinsics layer filled in.

A sensible progression is to land fp32 normalization first, then decode-critical attention, then activation and position kernels, then add a clean SVE2 tier, then exploit bfloat16 on hardware that supports it.

Step 1 — land NEON RMSNorm and LayerNorm

#if defined(__ARM_NEON) || defined(__aarch64__)
    rmsnorm_forward_neon(input, gamma, output, D, eps);
    layernorm_forward_neon(input, gamma, beta, output, D, eps);
#else
    rmsnorm_forward_ref(input, gamma, output, D, eps);
#endif

Step 2 — port attention decode to NEON

for (int t = 0; t < seq_len; ++t) {
    float32x4_t score_v = vdupq_n_f32(0.0f);
    for (int d = 0; d + 4 <= head_dim; d += 4) {
        float32x4_t qv = vld1q_f32(q + d);
        float32x4_t kv = vld1q_f32(k_cache + t * head_dim + d);
        score_v = vfmaq_f32(score_v, qv, kv);
    }
    scores[t] = vaddvq_f32(score_v);
}

Step 3 — add an SVE2 tier alongside NEON

#if defined(__ARM_FEATURE_SVE2)
    vec_dot_q8_0_q8_0_sve2(n, s, vx, vy);
#elif defined(__ARM_NEON) || defined(__aarch64__)
    vec_dot_q8_0_q8_0_neon(n, s, vx, vy);
#else
    vec_dot_q8_0_q8_0_ref(n, s, vx, vy);
#endif

Step 4 — exploit bfloat16 where the hardware earns it

#if defined(__ARM_FEATURE_SVE2) && defined(__ARM_FEATURE_BF16)
    gemm_bf16_sve2(y, W, x, M, N, K);
#else
    gemm_fp32_ref(y, W, x, M, N, K);
#endif

The right way to describe the remaining work is “porting and profiling,” not “discovering whether CPU inference is possible.” That discovery phase is already over. CKE’s scalar references and parity discipline mean each new ARM tier can be validated back to a known-good baseline with target error envelopes such as <1e-5 where appropriate.

The last step is empirical: run the new tiers on real hardware. Graviton4, Neoverse-based systems, or a future AGI CPU platform are where compiler codegen, cache behavior, and bandwidth realities become measurable instead of speculative.

The Bigger Picture: CPU-First Across ISAs

The ARM story sits inside a larger architectural pattern. CKE already treats ISA specialization as a first-class design axis: SSE2, AVX, AVX2+FMA, AVX-512, and now NEON are all visible tiers in the codebase. Future tiers such as SVE2, AMX, and RISC-V Vector fit naturally into the same contract.

That contract is why the project is interesting to silicon teams. The question is not merely whether a demo runs. The question is whether the repository makes the mapping from model math to instruction set explicit, correct, and reviewable. CKE increasingly does.

For readers evaluating the larger bet, the companion CKE scaling note is the broader systems argument behind this post. This NEON article is one concrete slice of that thesis: target instruction sets directly, measure the bottleneck, move the hot kernel, and keep the model inside a memory system that can actually serve it.

ISA tier	Current CKE status	Strategic meaning
SSE2 / SSSE3	Present in older x86 fallback kernels	Baseline x86 portability tier.
AVX	Present in many fp32 and quantized kernels	First 256-bit x86 widening tier.
AVX2+FMA	Heavy production use	Current mainstream x86 CPU AI tier.
AVX-512 / VNNI / AMX	Selective advanced kernels	Explicit mapping to newer Intel server/client parts.
NEON	3 quantized kernels today	Real ARM proof point already shipping.
SVE2	Feature-detected, not yet broadly implemented	Natural ARM server expansion tier.
RVV	Mentioned in feature detection and roadmap	Future CPU-first portability story beyond ARM/x86.

Kernel contract visible at file level

/**
 * CK-ENGINE KERNEL RULES:
 * =======================
 * 1. NO malloc/free - memory via bump allocator, pointers passed in
 * 2. NO OpenMP - parallelization at orchestrator/codegen layer
 * 3. API must define: inputs, outputs, workspace, and memory layouts
 * 4. Pure computation - deterministic, no side effects
 */

How CKE scales model support without rewriting everything

text

template
  -> IR / kernel contract
  -> code generation
  -> specialized C kernel families
  -> ISA-specific intrinsics where profitable
  -> scalar fallback where parity must dominate

Future ISA tiers fit the same architecture

text

Current:  SSE2 / AVX / AVX2+FMA / AVX-512 / NEON
Next:     SVE2 / AMX / RISC-V Vector
Constant: same algorithm, same kernel contract, different execution width

Repository shape is itself an engineering signal. Eighty-three kernel source files and forty-nine x86-SIMD files tell a silicon reviewer that CKE is treating kernels as a real substrate, not as a thin wrapper around generic libraries. 83 files of kernel C source today. The ARM opportunity is to raise the NEON/SVE2 share of that inventory.

That is also why the x86 SIMD deep dive matters as a companion. It established that CKE already reasons correctly about width, reduction ergonomics, and fused instructions. This ARM NEON walkthrough shows that the same engineering habit already exists on ARM, even if ARM coverage is earlier in its build-out.

Conclusion: An Open Invitation

CKE runs on ARM today. Not as a promise, not as a future branch, and not as a marketing abstraction. It ran Qwen2 and Qwen3 on real Cortex-A72 hardware, and the quantized inner loops discussed above contain real NEON intrinsics.

That is the invitation to ARM’s silicon team. The ISA mapping is already explicit. The algorithmic substrate is already correct. More than eighty kernels are still open for NEON and SVE2 ports, but that is exactly the kind of mechanical, high-leverage work that turns an existence proof into a flagship platform story.

ARM AGI CPU with 136 Neoverse V3 cores, SVE2, bfloat16, 12-channel DDR5, and CXL-backed multi-terabyte memory is not a stretch goal for CKE. It is the natural next environment in which to prove the full CPU-first thesis at server scale. CKE GitHub: https://github.com/antshiv/C-Kernel-Engine · Docs: https://c-kernel-engine.github.io/C-Kernel-Engine/ · Scaling thesis: https://c-kernel-engine.github.io/C-Kernel-Engine/scaling.html · YouTube: youtube.com/@antshivrobotics

Closing summary for ARM reviewers

text

Already true today:
  - Qwen2 and Qwen3 ran on TI TDA4VM
  - CKE ships NEON kernels for Q8_0×Q8_0, Q5_0×Q8_0, Q6_K×Q8_K
  - NEON is active in public dispatch for Q8_0 and Q5_0
  - Q6_K NEON path is compiled and benchmarkable

Natural next step:
  - Port normalization, attention, RoPE, softmax, and activations
  - Add SVE2 tier
  - Add bf16 kernels
  - Validate on Neoverse-class silicon

ARM NEON in C-Kernel-Engine: Real Quantized Kernels on ARM

What this post covers

CKE Already Runs on ARM

ARM NEON: The ISA Foundation

The Dispatch Pattern: How CKE Selects ARM NEON

Q8_0 × Q8_0: The Cleanest NEON Kernel

Q5_0 × Q8_0: Bit-Unpacking Plus NEON

Q6_K × Q8_K: The Most Complex NEON Kernel

The Horizontal Reduction Problem

What Is Not NEON-Optimized Yet

SVE2: What ARM AGI CPU Unlocks

The TDA4VM Story: Where This Already Works

Compile-Time vs Runtime Dispatch

Memory Bandwidth: Why ARM Server CPUs Matter for GEMV

From NEON to Production

The Bigger Picture: CPU-First Across ISAs

Conclusion: An Open Invitation

ShivasNotes

Explore

Connect

ARM NEON in C-Kernel-Engine: Real Quantized Kernels on ARM

What this post covers

CKE Already Runs on ARM

ARM NEON: The ISA Foundation

The Dispatch Pattern: How CKE Selects ARM NEON

Q8_0 × Q8_0: The Cleanest NEON Kernel

Q5_0 × Q8_0: Bit-Unpacking Plus NEON

Q6_K × Q8_K: The Most Complex NEON Kernel

The Horizontal Reduction Problem

What Is Not NEON-Optimized Yet

SVE2: What ARM AGI CPU Unlocks

The TDA4VM Story: Where This Already Works

Compile-Time vs Runtime Dispatch

Memory Bandwidth: Why ARM Server CPUs Matter for GEMV

From NEON to Production

The Bigger Picture: CPU-First Across ISAs

Conclusion: An Open Invitation

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect