SIMD Deep Dive: How AI Kernels Use SSE, AVX, AVX-512, and VNNI

ML fundamentals · SIMD deep dive

This ShivasNotes deep dive is written for CPU silicon teams evaluating kernel-level software competence. The subject is not “can the model run?” but “how explicitly does the software map ML math onto the ISA?” Every code block below comes from the real C-Kernel-Engine production tree, and the companion walkthrough lives at youtube.com/@antshivrobotics.

The thesis of this post is simple: the same math, wider registers, fewer instructions. C-Kernel-Engine keeps the algorithm constant and changes only the execution width, reduction strategy, and fused instructions available on the host CPU.

For a CPU vendor, this is the signal that matters: the code does not hand-wave “SIMD acceleration.” It names the exact intrinsics, the exact fallback path, and the exact parity escape hatch.

What this post covers

The first part frames why CPU inference becomes practical only when the kernel layer is ISA-aware. The middle builds the SIMD and dispatch model, then walks real kernels: RMSNorm, AdamW, DeltaNet, and quantized GEMV across SSE2, AVX, AVX2+FMA, AVX-512, and VNNI.

The final part switches from single kernels to system concerns: bandwidth, parity, repository organization, and what the same architecture means for ARM NEON/SVE2 and future matrix extensions.

Why SIMD Matters for AI on CPU

There is a very large difference between “CPU inference works” and “CPU inference is competitive.” At batch size 1, the model is often dominated by single-token decode latency, and on a modern desktop-class core complex that can already mean practical throughput. The difference between a disappointing CPU stack and a useful one is usually not the transformer equations. It is whether the hot kernels are written to exploit vector width, fused operations, and cache-friendly layouts.

C-Kernel-Engine currently ships 83 kernel .c files in src/kernels, with 49 of them containing explicit SIMD intrinsic paths. That is the software answer to the old “CPU is slow for AI” refrain. The scalar math is only the baseline. The production story is SSE2 moving 4 FP32 values, AVX moving 8, AVX2+FMA keeping 8 lanes but collapsing multiply-plus-add pairs, and AVX-512 doubling width again to 16 FP32 lanes while also improving reductions.

That widening is why a CPU-only runtime can move from “demo” to “usable.” The algorithm for RMSNorm, AdamW, or a quantized dot product does not change between tiers. What changes is the amount of useful work completed per instruction retirement.

100 tok/s Single-stream CPU inference can already be practical on strong x86 client silicon. SIMD is what turns that possibility into a sustained kernel-level implementation strategy.

Overview of scalar, SSE2, AVX, AVX2 plus FMA, and AVX-512 throughput tiers for AI kernels.

C-Kernel-Engine kernel rules header — the contract every hot loop follows

/**
 * @file rmsnorm_kernels.c
 * @brief RMSNorm forward/backward kernels with SIMD (SSE/AVX/AVX512)
 *
 * CK-ENGINE KERNEL RULES:
 * =======================
 * 1. NO malloc/free - memory via bump allocator, pointers passed in
 * 2. NO OpenMP - parallelization at orchestrator/codegen layer
 * 3. API must define: inputs, outputs, workspace, and memory layouts
 * 4. Pure computation - deterministic, no side effects
 *
 * After changes: make test && make llamacpp-parity-full
 *
 * RMSNorm: y[i] = gamma[i] * x[i] / sqrt(mean(x^2) + eps)
 */

The header tells you almost everything about the design philosophy. No malloc. No OpenMP inside kernels. Pure computation only. That matters because SIMD optimization only stays auditable when kernel scope is narrow. If orchestration, allocation, and threading policy are mixed into the same file, ISA-level reasoning becomes noisy very quickly.

What Is SIMD? Single Instruction, Multiple Data

SIMD means one instruction applies the same operation to multiple independent lanes at once. On x86 FP32 code, the mental model is straightforward: a 128-bit register can hold 4 floats, a 256-bit register can hold 8, and a 512-bit register can hold 16. The instruction semantics stay uniform across lanes; only the lane count changes.

\text{FP32 lanes} = \frac{\text{register width in bits}}{32}\quad\Rightarrow\quad128\text{-bit}=4,\;256\text{-bit}=8,\;512\text{-bit}=16

ISA tier	Register type	Width	FP32 lanes	Typical intrinsic prefix
SSE2	`__m128`	128-bit	4	`_mm_*`
AVX	`__m256`	256-bit	8	`_mm256_*`
AVX2+FMA	`__m256`	256-bit	8	`_mm256_fmadd_*`
AVX-512	`__m512`	512-bit	16	`_mm512_*`

text

SSE2   : [x0 x1 x2 x3]          -> 4 FP32 lanes
AVX    : [x0 x1 x2 x3 x4 x5 x6 x7] -> 8 FP32 lanes
AVX-512: [x0 ... x15]               -> 16 FP32 lanes

Operation example:
  y = a * b + c
Each lane performs the same arithmetic independently.

This is why ML kernels map so cleanly onto SIMD. RMSNorm, MLPs, attention projections, optimizer updates, and dequantized dot products are full of elementwise operations and short reduction trees. SIMD is not a special trick for AI; it is the native execution surface for dense numeric code.

Think of SIMD as horizontal replication of the same scalar kernel body. The software engineer’s task is to preserve the math while widening the body as far as the ISA permits.

The x86 SIMD Evolution

x86 SIMD evolution is not one story but four distinct software eras. SSE2 established the 128-bit baseline. AVX doubled width but left many reductions and multiply-add patterns relatively clumsy. AVX2 plus FMA kept the same 256-bit width yet materially changed kernel shape because a*b+c stopped costing two separate arithmetic instructions. AVX-512 doubled width again and made reductions dramatically cleaner.

ISA	Representative launch	Width	FP32 lanes	What changed for ML kernels
SSE2	2001 · Pentium 4	128-bit	4	First universal x86 FP32 vector baseline.
AVX	2011 · Sandy Bridge	256-bit	8	Double width, but dot-product style code still needs separate mul and add.
AVX2+FMA	2013 · Haswell	256-bit	8	Integer widening plus fused multiply-add for accumulation-heavy kernels.
AVX-512	2017 · Skylake-X	512-bit	16	Wider vectors, built-in reductions, richer byte/int data movement, VNNI on later parts.
ARM NEON / SVE2	Ongoing	128-bit / scalable	4 / VLA	Different syntax, same thesis: keep the math, change the vectorization surface.

Timeline of SSE2, AVX, AVX2 plus FMA, AVX-512, and future ARM SVE2 and AMX style extensions.

Feature detection in include/ck_features.h — x86 tier macros

/* Intel/x86 */
#if defined(__AMX__)
    #define CK_HAS_AMX 1
#endif

#if defined(__AVX512F__) && defined(__AVX512BW__) && defined(__AVX512DQ__)
    #define CK_HAS_AVX512 1
#endif

#if defined(__AVX2__) && defined(__FMA__)
    #define CK_HAS_AVX2_FMA 1
#endif

#if defined(__AVX__)
    #define CK_HAS_AVX 1
#endif

Feature detection in include/ck_features.h — ARM hooks already exist

/* ARM */
#if defined(__aarch64__)
    #if defined(__ARM_FEATURE_SVE2)
        #define CK_HAS_SVE2 1
    #endif
    #if defined(__ARM_FEATURE_NEON)
        #define CK_HAS_NEON 1
    #endif
#endif

This header makes the repo’s hardware model explicit: compile-time capability macros first, dispatch macros second. It already names AMX, AVX-512, AVX2+FMA, AVX, NEON, SVE2, VNNI, and RVV. That is the kind of architecture CPU vendors want to see because the optimization surface is described in the codebase itself, not hidden inside build folklore.

CKE's Dispatch Pattern

C-Kernel-Engine dispatch is deliberately boring. That is a compliment. The pattern is always some variation of #if defined(__AVX512F__), then AVX2 or AVX, then scalar fallback, with ck_strict_parity_enabled() taking precedence whenever exactness matters more than speed. Same API. Same algorithm. Different parallelism.

RMSNorm dispatch gate — lines 140-144

    const char *exact_env = getenv("CK_RMSNORM_EXACT");
    if (ck_strict_parity_enabled() || (exact_env && atoi(exact_env) != 0)) {
        rmsnorm_forward_strict_scalar(input, gamma, output, rstd_cache, T, D, aligned, eps);
        return;
    }

DeltaNet dispatch and implementation name reporting

static int ck_deltanet_force_ref(void)
{
    const char *env = getenv("CK_DELTANET_FORCE_REF");
    return env && atoi(env) != 0;
}

const char *gated_deltanet_impl_name(void)
{
    if (ck_strict_parity_enabled() || ck_deltanet_force_ref()) {
        return "REF";
    }
#if defined(__AVX512F__)
    return "AVX512";
#elif defined(__AVX2__)
    return "AVX2";
#elif defined(__AVX__)
    return "AVX";
#else
    return "REF";
#endif
}

void gated_deltanet_autoregressive_forward(const float *q,
                                           const float *k,
                                           const float *v,
                                           const float *g,
                                           const float *beta,
                                           const float *state_in,
                                           float *state_out,
                                           float *out,
                                           int num_heads,
                                           int state_dim,
                                           float norm_eps)
{
    if (!q || !k || !v || !g || !beta || !state_in || !state_out || !out) {
        return;
    }
    if (num_heads <= 0 || state_dim <= 0) {
        return;
    }

    /*
     * q and k arrive pre-normalized by recurrent_qk_l2_norm, so the
     * ISA-specialized kernels can follow the same contract as the scalar ref.
     */
    if (ck_strict_parity_enabled() || ck_deltanet_force_ref()) {
        gated_deltanet_autoregressive_forward_ref(
            q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
        return;
    }
#if defined(__AVX512F__)
    gated_deltanet_autoregressive_forward_avx512(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#elif defined(__AVX2__)
    gated_deltanet_autoregressive_forward_avx2(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#elif defined(__AVX__)
    gated_deltanet_autoregressive_forward_avx(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#else
    gated_deltanet_autoregressive_forward_ref(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#endif

There are two important details here. First, strict parity is a first-class override, not a debug afterthought. Second, environment variables such as CK_RMSNORM_EXACT and CK_DELTANET_FORCE_REF make it easy to pin a kernel to the reference path during validation.

For silicon bring-up and kernel validation, the best dispatch model is the one you can turn off on command.

Best-vector and quantized dispatch macros in ck_features.h

/* Best available vector width */
#if defined(CK_HAS_AMX)
    #define CK_VECTOR_WIDTH 512
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX512)
    #define CK_VECTOR_WIDTH 512
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX2_FMA)
    #define CK_VECTOR_WIDTH 256
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX)
    #define CK_VECTOR_WIDTH 256
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_NEON)
    #define CK_VECTOR_WIDTH 128
    #define CK_HAS_BEST_VECTOR 1
#else
    #define CK_VECTOR_WIDTH 32  /* Scalar fallback */
    #define CK_HAS_BEST_VECTOR 0
#endif

#if defined(CK_HAS_AMX)
    #define CK_GEMM_DISPATCH(...) gemm_amx(__VA_ARGS__)
#elif defined(CK_HAS_AVX512)
    #define CK_GEMM_DISPATCH(...) gemm_avx512(__VA_ARGS__)
#elif defined(CK_HAS_AVX2_FMA)
    #define CK_GEMM_DISPATCH(...) gemm_avx2(__VA_ARGS__)
#elif defined(CK_HAS_AVX)
    #define CK_GEMM_DISPATCH(...) gemm_avx(__VA_ARGS__)
#else
    #define CK_GEMM_DISPATCH(...) gemm_ref(__VA_ARGS__)
#endif

/**
 * @brief Dispatch to best available GEMV kernel
 */
#if defined(CK_HAS_AMX)
    #define CK_GEMV_DISPATCH(...) gemv_amx(__VA_ARGS__)
#elif defined(CK_HAS_AVX512)
    #define CK_GEMV_DISPATCH(...) gemv_avx512(__VA_ARGS__)
#elif defined(CK_HAS_AVX2_FMA)
    #define CK_GEMV_DISPATCH(...) gemv_avx2(__VA_ARGS__)
#elif defined(CK_HAS_AVX)
    #define CK_GEMV_DISPATCH(...) gemv_avx(__VA_ARGS__)
#else
    #define CK_GEMV_DISPATCH(...) gemv_ref(__VA_ARGS__)
#endif

/**
 * @brief Dispatch to best available quantized GEMV kernel
 * For INT8/INT4 quantization with VNNI/AMX acceleration
 */
#if defined(CK_HAS_AMX)
    #define CK_QGEMM_DISPATCH(...) qgemm_amx(__VA_ARGS__)
#elif defined(CK_HAS_AVX512VNNI)
    #define CK_QGEMM_DISPATCH(...) qgemm_avx512vnni(__VA_ARGS__)
#elif defined(CK_HAS_VNNI)
    #define CK_QGEMM_DISPATCH(...) qgemm_vnni(__VA_ARGS__)
#elif defined(CK_HAS_AVX2_FMA)
    #define CK_QGEMM_DISPATCH(...) qgemm_avx2(__VA_ARGS__)
#else
    #define CK_QGEMM_DISPATCH(...) qgemm_ref(__VA_ARGS__)

Why this matters to an ISA reviewer

The dispatch code is not pretending that wider hardware automatically preserves scalar behavior. It explicitly encodes the trade: fast path when allowed, reference path when parity is required.

That makes every optimization auditable. You can reason about correctness per tier because the boundary between tiers is an ordinary function call, not a hidden JIT rewrite.

RMSNorm Forward: Scalar → AVX → AVX-512

The earlier LayerNorm and RMSNorm post covered the normalization math. Here the interesting part is what the current production kernel does with that math. In the present C-Kernel-Engine tree, rmsnorm_forward() carries a strict scalar path, an AVX path, and an AVX-512 path in one function. SSE2 does not appear in this specific kernel; the SSE2 tier shows up clearly in AdamW below. That honesty is part of the point: kernel coverage is per operation, not marketing copy.

\mu_2=\operatorname{mean}(x^2)=\frac{1}{D}\sum_{i=0}^{D-1}x_i^2

y_i=\gamma_i\frac{x_i}{\sqrt{\mu_2+\epsilon}}

Strict scalar RMSNorm reference

static void rmsnorm_forward_strict_scalar(const float *input,
                                          const float *gamma,
                                          float *output,
                                          float *rstd_cache,
                                          int tokens,
                                          int d_model,
                                          int aligned_embed_dim,
                                          float eps)
{
    const float inv_d = 1.0f / (float)d_model;
    for (int t = 0; t < tokens; ++t) {
        const float *x = input + (size_t)t * (size_t)aligned_embed_dim;
        float *y = output + (size_t)t * (size_t)aligned_embed_dim;

        float sum_sq = 0.0f;
        for (int d = 0; d < d_model; ++d) {
            const float v = x[d];
            sum_sq += v * v;
        }
        const float mean_sq = sum_sq * inv_d;
        const float rstd = 1.0f / sqrtf(mean_sq + eps);
        if (rstd_cache) {
            rstd_cache[t] = rstd;
        }

        for (int d = 0; d < d_model; ++d) {
            const float x_hat = x[d] * rstd;
            y[d] = x_hat * gamma[d];
        }
        for (int d = d_model; d < aligned_embed_dim; ++d) {
            y[d] = 0.0f;
        }
    }
}

The scalar reference is the truth source: accumulate sum_sq, form rstd, then scale by gamma. Nothing fancy. This is the path that makes parity testing meaningful because every vectorized path can be compared against it.

2 passes RMSNorm does one reduction pass for the mean square and one elementwise pass for normalization and scale.

AVX1 helper because 256-bit horizontal reduction is not native

/* AVX1 horizontal sum helper (no _mm256_reduce_add_ps in AVX1) */
#if defined(__AVX__) && !defined(__AVX512F__)
static inline float hsum256_ps_rmsnorm(__m256 v) {
    // Sum upper and lower 128-bit lanes
    __m128 hi = _mm256_extractf128_ps(v, 1);
    __m128 lo = _mm256_castps256_ps128(v);
    __m128 sum128 = _mm_add_ps(lo, hi);
    // Horizontal add within 128-bit lane
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    return _mm_cvtss_f32(sum128);
}

AVX RMSNorm path — 8 floats per iteration, no FMA

#elif defined(__AVX__)
        // AVX: Process 8 floats at a time
        __m256 sum_sq_vec = _mm256_setzero_ps();
        int d = 0;

        // Vectorized sum of squares (no FMA in AVX1, use mul + add)
        for (; d + 8 <= D; d += 8) {
            __m256 xv = _mm256_loadu_ps(&x[d]);
            __m256 xv_sq = _mm256_mul_ps(xv, xv);
            sum_sq_vec = _mm256_add_ps(sum_sq_vec, xv_sq);
        }
        float sum_sq = hsum256_ps_rmsnorm(sum_sq_vec);

        // Handle remaining elements
        for (; d < D; ++d) {
            sum_sq += x[d] * x[d];
        }

        float mean_sq = sum_sq / (float)D;
        float rstd = 1.0f / sqrtf(mean_sq + eps);
        if (rstd_cache) {
            rstd_cache[t] = rstd;
        }

        // Apply normalization and scale (vectorized)
        __m256 rstd_vec = _mm256_set1_ps(rstd);
        d = 0;
        for (; d + 8 <= D; d += 8) {
            __m256 xv = _mm256_loadu_ps(&x[d]);
            __m256 gv = _mm256_loadu_ps(&gamma[d]);
            __m256 x_hat = _mm256_mul_ps(xv, rstd_vec);
            __m256 yv = _mm256_mul_ps(x_hat, gv);
            _mm256_storeu_ps(&y[d], yv);
        }
        // Handle remaining elements
        for (; d < D; ++d) {
            y[d] = x[d] * rstd * gamma[d];
        }

AVX-512 RMSNorm path — 16 floats per iteration with fused accumulation

#if defined(__AVX512F__)
        // AVX-512: Process 16 floats at a time
        __m512 sum_sq_vec = _mm512_setzero_ps();
        int d = 0;

        // Vectorized sum of squares
        for (; d + 16 <= D; d += 16) {
            __m512 xv = _mm512_loadu_ps(&x[d]);
            sum_sq_vec = _mm512_fmadd_ps(xv, xv, sum_sq_vec);
        }
        float sum_sq = _mm512_reduce_add_ps(sum_sq_vec);

        // Handle remaining elements
        for (; d < D; ++d) {
            sum_sq += x[d] * x[d];
        }

        float mean_sq = sum_sq / (float)D;
        float rstd = 1.0f / sqrtf(mean_sq + eps);
        if (rstd_cache) {
            rstd_cache[t] = rstd;
        }

        // Apply normalization and scale (vectorized)
        __m512 rstd_vec = _mm512_set1_ps(rstd);
        d = 0;
        for (; d + 16 <= D; d += 16) {
            __m512 xv = _mm512_loadu_ps(&x[d]);
            __m512 gv = _mm512_loadu_ps(&gamma[d]);
            __m512 x_hat = _mm512_mul_ps(xv, rstd_vec);
            __m512 yv = _mm512_mul_ps(x_hat, gv);
            _mm512_storeu_ps(&y[d], yv);
        }
        // Handle remaining elements
        for (; d < D; ++d) {
            y[d] = x[d] * rstd * gamma[d];
        }

RMSNorm forward comparing scalar, AVX, and AVX-512 instruction counts and reduction behavior.

The progression is the whole SIMD story in miniature. AVX1 computes x*x and then adds it into the accumulator with two separate operations. AVX-512 replaces that pair with _mm512_fmadd_ps(xv, xv, sum_sq_vec). AVX1 also needs a bespoke helper to collapse an __m256 into one scalar. AVX-512 gets _mm512_reduce_add_ps directly. Same RMSNorm equation. Fewer instructions. Cleaner reduction.

This section is also a good reminder that SIMD maturity is not only about width. AVX and AVX2 share the same 256-bit register size, but once FMA appears, the cost model for accumulation-heavy kernels changes materially even before AVX-512 enters the picture.

AVX-512 is not just “AVX but bigger.” The reduction ergonomics are radically better for normalization kernels.

AdamW Optimizer: SSE2 / AVX / AVX-512

This is the kernel that actually mutates model weights. The AdamW discussion from Optimizers: From SGD to AdamW is where instruction selection becomes impossible to ignore because almost every line is a weighted sum, a square, a bias correction, or a final fused update.

m_t=\beta_1m_{t-1}+(1-\beta_1)g_t

v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2

w_t=w_{t-1}-\eta\left(\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}+\lambda w_{t-1}\right)

AdamW algorithm comment and long-horizon fp32 warning

 * AdamW Algorithm:
 *   m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
 *   v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2
 *   m_hat = m_t / (1 - beta1^t)
 *   v_hat = v_t / (1 - beta2^t)
 *   w_t = w_{t-1} - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w_{t-1})
 *
 * Note: AdamW applies weight decay directly to weights, not to gradients.
 * This is different from L2 regularization (Adam with L2 adds decay to gradient).
 *
 * Epsilon amplification at early steps: at step 1 with bc2=0.001, elements where
 * v≈0 (sparse/near-zero gradients) produce sqrt(v_hat)+eps ≈ eps=1e-8, amplifying
 * any fp32 rounding in the accumulated gradient by up to lr/eps = 1e6. This is
 * expected AdamW behavior, not a bug. In parity tests, gate on mean_param_diff
 * (not max_param_diff) for grad_accum > 1 to avoid false alarms from these outliers.
 *
 * Long-horizon fp32 risk: at lr=1e-3 with all-fp32 SIMD paths, known drift begins
 * around step ~800 due to accumulated rounding. Use ck_strict_parity_enabled() (fp64
 * path) for parity validation; for production training keep lr < 1e-3 or monitor.
 */

Strict parity AdamW path — fp64 scalar reference

    if (ck_strict_parity_enabled()) {
        const double beta1_d = (double)beta1;
        const double beta2_d = (double)beta2;
        const double one_minus_beta1_d = 1.0 - beta1_d;
        const double one_minus_beta2_d = 1.0 - beta2_d;
        const double lr_d = (double)lr;
        const double eps_d = (double)eps;
        const double wd_d = (double)weight_decay;

        const double bc1 = 1.0 - pow(beta1_d, (double)step);
        const double bc2 = 1.0 - pow(beta2_d, (double)step);
        const double step_size = lr_d / bc1;
        const double bc2_sqrt = sqrt(bc2);
        const double wd_scale = 1.0 - lr_d * wd_d;

        for (size_t i = 0; i < numel; ++i) {
            double g = (double)grad[i];
            double w = (double)weight[i];
            double m_i = (double)m[i];
            double v_i = (double)v[i];

            m_i = beta1_d * m_i + one_minus_beta1_d * g;
            v_i = beta2_d * v_i + one_minus_beta2_d * g * g;

            // Match PyTorch AdamW op order: decoupled weight decay first.
            w *= wd_scale;

            double denom = sqrt(v_i) / bc2_sqrt + eps_d;
            w -= step_size * (m_i / denom);

            m[i] = (float)m_i;
            v[i] = (float)v_i;
            weight[i] = (float)w;
        }
        return;

AVX-512 AdamW — fmadd and fnmadd in the hot loop

#if defined(__AVX512F__)
    // AVX-512 path: process 16 floats at a time
    __m512 v_beta1 = _mm512_set1_ps(beta1);
    __m512 v_beta2 = _mm512_set1_ps(beta2);
    __m512 v_one_minus_beta1 = _mm512_set1_ps(one_minus_beta1);
    __m512 v_one_minus_beta2 = _mm512_set1_ps(one_minus_beta2);
    __m512 v_lr = _mm512_set1_ps(lr);
    __m512 v_eps = _mm512_set1_ps(eps);
    __m512 v_weight_decay = _mm512_set1_ps(weight_decay);
    __m512 v_bc1_inv = _mm512_set1_ps(1.0f / bias_correction1);
    __m512 v_bc2_inv = _mm512_set1_ps(1.0f / bias_correction2);

    size_t i = 0;
    for (; i + 16 <= numel; i += 16) {
        __m512 g = _mm512_loadu_ps(&grad[i]);
        __m512 w = _mm512_loadu_ps(&weight[i]);
        __m512 m_val = _mm512_loadu_ps(&m[i]);
        __m512 v_val = _mm512_loadu_ps(&v[i]);

        // m = beta1 * m + (1 - beta1) * g
        m_val = _mm512_fmadd_ps(v_beta1, m_val, _mm512_mul_ps(v_one_minus_beta1, g));

        // v = beta2 * v + (1 - beta2) * g^2
        __m512 g_sq = _mm512_mul_ps(g, g);
        v_val = _mm512_fmadd_ps(v_beta2, v_val, _mm512_mul_ps(v_one_minus_beta2, g_sq));

        // Bias-corrected estimates
        __m512 m_hat = _mm512_mul_ps(m_val, v_bc1_inv);
        __m512 v_hat = _mm512_mul_ps(v_val, v_bc2_inv);

        // w = w - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w)
        __m512 denom = _mm512_add_ps(_mm512_sqrt_ps(v_hat), v_eps);
        __m512 update = _mm512_div_ps(m_hat, denom);
        update = _mm512_fmadd_ps(v_weight_decay, w, update);
        w = _mm512_fnmadd_ps(v_lr, update, w);

        _mm512_storeu_ps(&weight[i], w);
        _mm512_storeu_ps(&m[i], m_val);
        _mm512_storeu_ps(&v[i], v_val);
    }

AVX AdamW — same math, explicit mul plus add and mul plus sub

#elif defined(__AVX__)
    // AVX path: process 8 floats at a time (no FMA on older CPUs like Ivy Bridge)
    __m256 v_beta1 = _mm256_set1_ps(beta1);
    __m256 v_beta2 = _mm256_set1_ps(beta2);
    __m256 v_one_minus_beta1 = _mm256_set1_ps(one_minus_beta1);
    __m256 v_one_minus_beta2 = _mm256_set1_ps(one_minus_beta2);
    __m256 v_lr = _mm256_set1_ps(lr);
    __m256 v_eps = _mm256_set1_ps(eps);
    __m256 v_weight_decay = _mm256_set1_ps(weight_decay);
    __m256 v_bc1_inv = _mm256_set1_ps(1.0f / bias_correction1);
    __m256 v_bc2_inv = _mm256_set1_ps(1.0f / bias_correction2);

    size_t i = 0;
    for (; i + 8 <= numel; i += 8) {
        __m256 g = _mm256_loadu_ps(&grad[i]);
        __m256 w = _mm256_loadu_ps(&weight[i]);
        __m256 m_val = _mm256_loadu_ps(&m[i]);
        __m256 v_val = _mm256_loadu_ps(&v[i]);

        // m = beta1 * m + (1 - beta1) * g
        m_val = _mm256_add_ps(_mm256_mul_ps(v_beta1, m_val),
                              _mm256_mul_ps(v_one_minus_beta1, g));

        // v = beta2 * v + (1 - beta2) * g^2
        __m256 g_sq = _mm256_mul_ps(g, g);
        v_val = _mm256_add_ps(_mm256_mul_ps(v_beta2, v_val),
                              _mm256_mul_ps(v_one_minus_beta2, g_sq));

        // Bias-corrected estimates
        __m256 m_hat = _mm256_mul_ps(m_val, v_bc1_inv);
        __m256 v_hat = _mm256_mul_ps(v_val, v_bc2_inv);

        // w = w - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w)
        __m256 denom = _mm256_add_ps(_mm256_sqrt_ps(v_hat), v_eps);
        __m256 update = _mm256_div_ps(m_hat, denom);
        update = _mm256_add_ps(update, _mm256_mul_ps(v_weight_decay, w));
        w = _mm256_sub_ps(w, _mm256_mul_ps(v_lr, update));

        _mm256_storeu_ps(&weight[i], w);
        _mm256_storeu_ps(&m[i], m_val);
        _mm256_storeu_ps(&v[i], v_val);

SSE2 AdamW — 4 floats at a time

#elif defined(__SSE2__)
    // SSE2 path: process 4 floats at a time
    __m128 v_beta1 = _mm_set1_ps(beta1);
    __m128 v_beta2 = _mm_set1_ps(beta2);
    __m128 v_one_minus_beta1 = _mm_set1_ps(one_minus_beta1);
    __m128 v_one_minus_beta2 = _mm_set1_ps(one_minus_beta2);
    __m128 v_lr = _mm_set1_ps(lr);
    __m128 v_eps = _mm_set1_ps(eps);
    __m128 v_weight_decay = _mm_set1_ps(weight_decay);
    __m128 v_bc1_inv = _mm_set1_ps(1.0f / bias_correction1);
    __m128 v_bc2_inv = _mm_set1_ps(1.0f / bias_correction2);

    size_t i = 0;
    for (; i + 4 <= numel; i += 4) {
        __m128 g = _mm_loadu_ps(&grad[i]);
        __m128 w = _mm_loadu_ps(&weight[i]);
        __m128 m_val = _mm_loadu_ps(&m[i]);
        __m128 v_val = _mm_loadu_ps(&v[i]);

        // m = beta1 * m + (1 - beta1) * g
        m_val = _mm_add_ps(_mm_mul_ps(v_beta1, m_val),
                           _mm_mul_ps(v_one_minus_beta1, g));

        // v = beta2 * v + (1 - beta2) * g^2
        __m128 g_sq = _mm_mul_ps(g, g);
        v_val = _mm_add_ps(_mm_mul_ps(v_beta2, v_val),
                           _mm_mul_ps(v_one_minus_beta2, g_sq));

        // Bias-corrected estimates
        __m128 m_hat = _mm_mul_ps(m_val, v_bc1_inv);
        __m128 v_hat = _mm_mul_ps(v_val, v_bc2_inv);

        // w = w - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w)
        __m128 denom = _mm_add_ps(_mm_sqrt_ps(v_hat), v_eps);
        __m128 update = _mm_div_ps(m_hat, denom);
        update = _mm_add_ps(update, _mm_mul_ps(v_weight_decay, w));
        w = _mm_sub_ps(w, _mm_mul_ps(v_lr, update));

        _mm_storeu_ps(&weight[i], w);
        _mm_storeu_ps(&m[i], m_val);
        _mm_storeu_ps(&v[i], v_val);

AdamW optimizer update showing instruction count reduction from SSE2 and AVX to AVX-512 with fused operations.

Operation	SSE2 / AVX	AVX-512	What changed
`m = β₁m + (1-β₁)g`	`mul + mul + add`	`fmadd + mul`	One accumulation fused.
`v = β₂v + (1-β₂)g²`	`mul + mul + add`	`fmadd + mul`	Again, one arithmetic edge disappears.
`w = w - lr * update`	`mul + sub`	`fnmadd`	Negated multiply-add becomes one instruction.

The most revealing line in the AVX-512 path is not the obvious _mm512_fmadd_ps. It is _mm512_fnmadd_ps(v_lr, update, w). That is the optimizer update written exactly as the hardware wants to execute it: fused negate, multiply, add, then one rounding.

This is why FMA matters so much for ML. Training code is full of expressions that are literally multiply-add chains wearing different variable names.

The Horizontal Reduction Problem

SIMD loves independent lanes. Kernels love reductions. That tension is one of the main reasons ISA evolution changes software shape. A dot product or sum-of-squares spends most of its life as lane-local work and then suddenly needs one scalar answer. How ugly that collapse step is depends strongly on the ISA tier.

RMSNorm AVX horizontal sum helper reused as a reduction case study

/* AVX1 horizontal sum helper (no _mm256_reduce_add_ps in AVX1) */
#if defined(__AVX__) && !defined(__AVX512F__)
static inline float hsum256_ps_rmsnorm(__m256 v) {
    // Sum upper and lower 128-bit lanes
    __m128 hi = _mm256_extractf128_ps(v, 1);
    __m128 lo = _mm256_castps256_ps128(v);
    __m128 sum128 = _mm_add_ps(lo, hi);
    // Horizontal add within 128-bit lane
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    return _mm_cvtss_f32(sum128);
}

Tier	Reduction story	Practical consequence
SSE2 / SSE3 style	Repeated 128-bit horizontal adds.	Short but still multi-step.
AVX1	Split 256-bit vector into two 128-bit halves, then hadd twice.	Extra extract and lane-crossing overhead.
AVX-512	`_mm512_reduce_add_ps`	Reduction becomes explicit ISA support instead of helper choreography.

Diagram comparing manual AVX horizontal reduction against native AVX-512 reduce-add.

This is why AVX-512 can feel disproportionately valuable for normalization, softmax, optimizer statistics, and quantized dot products. Doubling width matters, but removing hand-built reduction trees matters too.

1 instruction In AVX-512 the final collapse can be represented directly as a reduce-add intrinsic instead of extract plus add plus hadd plus hadd.

DeltaNet State Sweep: AVX2 Row-Pair Unroll

The Gated DeltaNet post introduced DeltaNet as a recurrent fixed-memory alternative to scanning an ever-growing KV cache. The AVX2 kernel makes the systems point beautifully: the state update is not just vectorized; it is reorganized to process two rows of the state matrix at once so that scalar broadcasts and FMA opportunities are reused across more work.

DeltaNet AVX2 fused multiply-add helper

#if defined(__AVX2__)
static inline __m256 ck_deltanet_fmadd256(__m256 a, __m256 b, __m256 c)
{
#if defined(__FMA__)
    return _mm256_fmadd_ps(a, b, c);
#else
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
#endif
}

DeltaNet AVX2 row-pair sweep — two rows of state processed together

        for (; row + 2 <= state_dim; row += 2) {
            const size_t row0_off = (size_t)row * (size_t)state_dim;
            const size_t row1_off = (size_t)(row + 1) * (size_t)state_dim;
            const __m256 k0_v = _mm256_set1_ps(k_hat[row]);
            const __m256 k1_v = _mm256_set1_ps(k_hat[row + 1]);

            col = 0;
            for (; col + 8 <= state_dim; col += 8) {
                __m256 prev0_v = _mm256_loadu_ps(state_prev + row0_off + (size_t)col);
                __m256 prev1_v = _mm256_loadu_ps(state_prev + row1_off + (size_t)col);
                __m256 cur0_v = _mm256_mul_ps(prev0_v, gate_v);
                __m256 cur1_v = _mm256_mul_ps(prev1_v, gate_v);
                __m256 kv_v = _mm256_loadu_ps(kv_mem + col);
                kv_v = ck_deltanet_fmadd256(cur0_v, k0_v, kv_v);
                kv_v = ck_deltanet_fmadd256(cur1_v, k1_v, kv_v);
                _mm256_storeu_ps(state_cur + row0_off + (size_t)col, cur0_v);
                _mm256_storeu_ps(state_cur + row1_off + (size_t)col, cur1_v);
                _mm256_storeu_ps(kv_mem + col, kv_v);
            }
            for (; col < state_dim; ++col) {
                const float cur0 = state_prev[row0_off + (size_t)col] * gate;
                const float cur1 = state_prev[row1_off + (size_t)col] * gate;
                state_cur[row0_off + (size_t)col] = cur0;
                state_cur[row1_off + (size_t)col] = cur1;
                kv_mem[col] += cur0 * k_hat[row] + cur1 * k_hat[row + 1];
            }

DeltaNet AVX2 row-pair writeback and readout

        row = 0;
        for (; row + 2 <= state_dim; row += 2) {
            const size_t row0_off = (size_t)row * (size_t)state_dim;
            const size_t row1_off = (size_t)(row + 1) * (size_t)state_dim;
            const __m256 k0_v = _mm256_set1_ps(k_hat[row]);
            const __m256 k1_v = _mm256_set1_ps(k_hat[row + 1]);
            const __m256 q0_v = _mm256_set1_ps(q_hat[row]);
            const __m256 q1_v = _mm256_set1_ps(q_hat[row + 1]);

            col = 0;
            for (; col + 8 <= state_dim; col += 8) {
                __m256 cur0_v = _mm256_loadu_ps(state_cur + row0_off + (size_t)col);
                __m256 cur1_v = _mm256_loadu_ps(state_cur + row1_off + (size_t)col);
                __m256 delta_v = _mm256_loadu_ps(delta + col);
                __m256 out_v = _mm256_loadu_ps(out_head + col);
                __m256 upd0_v = ck_deltanet_fmadd256(k0_v, delta_v, cur0_v);
                __m256 upd1_v = ck_deltanet_fmadd256(k1_v, delta_v, cur1_v);
                out_v = ck_deltanet_fmadd256(upd0_v, q0_v, out_v);
                out_v = ck_deltanet_fmadd256(upd1_v, q1_v, out_v);
                _mm256_storeu_ps(state_cur + row0_off + (size_t)col, upd0_v);
                _mm256_storeu_ps(state_cur + row1_off + (size_t)col, upd1_v);
                _mm256_storeu_ps(out_head + col, out_v);
            }
            for (; col < state_dim; ++col) {
                const float upd0 = state_cur[row0_off + (size_t)col] + k_hat[row] * delta[col];
                const float upd1 = state_cur[row1_off + (size_t)col] + k_hat[row + 1] * delta[col];
                state_cur[row0_off + (size_t)col] = upd0;
                state_cur[row1_off + (size_t)col] = upd1;
                out_head[col] += upd0 * q_hat[row] + upd1 * q_hat[row + 1];
            }

DeltaNet AVX2 kernel processing two state-matrix rows per iteration for better reuse and FMA density.

The key pattern is the pair of broadcasts k0_v and k1_v. Each is created once with _mm256_set1_ps and then reused across 8 output columns. That halves the outer-loop overhead relative to a one-row scheme while also raising the FMA density of the inner loop.

Row-pair unrolling is not “micro-optimization.” It is a deliberate decision about where to spend broadcasts, loads, and loop-control instructions.

DeltaNet is especially instructive for vendors because it is not a toy BLAS loop. The kernel is doing decay, recall, delta formation, rank-1 writeback, and readout over a d × d state matrix. Yet the vectorization pattern still reduces to the same ISA questions: broadcast reuse, FMA density, reduction shape, and scalar tails.

FMA: The Single Most Important Instruction

If one instruction deserves credit for making CPU AI kernels feel modern, it is FMA. ML arithmetic is mostly accumulations of the form a*b+c. Without FMA, the software pays for a multiply and then an add, with an intermediate rounding point. With FMA, the hardware does the combined operation in one fused step.

\operatorname{fma}(a,b,c)=a\cdot b+c

\operatorname{fnmadd}(a,b,c)=c-a\cdot b

RMSNorm accumulation with AVX-512 FMA

sum_sq_vec = _mm512_fmadd_ps(xv, xv, sum_sq_vec);

AdamW weight update with AVX-512 fused negate-multiply-add

w = _mm512_fnmadd_ps(v_lr, update, w);

DeltaNet AVX2 abstraction — FMA when available, mul+add otherwise

#if defined(__AVX2__)
static inline __m256 ck_deltanet_fmadd256(__m256 a, __m256 b, __m256 c)
{
#if defined(__FMA__)
    return _mm256_fmadd_ps(a, b, c);
#else
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
#endif
}

Kernel motif	No FMA	With FMA
RMSNorm sum of squares	`mul` then `add`	`fmadd(x, x, acc)`
AdamW momentum / variance	`mul + mul + add`	`mul + fmadd`
AdamW final step	`mul + sub`	`fnmadd`
DeltaNet state update	`mul + add`	`fmadd` via helper

AVX2 matters partly because of integer instructions, but for FP32 kernels the practical step change is often the presence of FMA. Width alone increases parallelism. FMA reduces the arithmetic graph itself.

1 rounding FMA produces one final rounding instead of rounding the multiply and the add separately. That is both a performance gain and a precision change.

Quantized SIMD: Dequant + Compute in Registers

The most intimidating SIMD in the repository is the quantized GEMM path. src/kernels/gemm_kernels_q6k_q8k.c is currently 1229 lines long and contains 470 intrinsic calls. That alone tells you what quantized CPU inference really is: not just smaller weights, but an ISA-level unpack, rescale, dot, and reduction pipeline.

Q6_K × Q8_K file header and packed-format comment

/**
 * @file gemm_kernels_q6k_q8k.c
 * @brief Q6_K (weights) x Q8_K (activations) kernels for inference
 *
 * CK-ENGINE KERNEL RULES:
 * =======================
 * 1. NO malloc/free - memory via bump allocator, pointers passed in
 * 2. NO OpenMP - parallelization at orchestrator/codegen layer
 * 3. API must define: inputs, outputs, workspace, and memory layouts
 * 4. Pure computation - deterministic, no side effects
 *
 * After changes: make test && make llamacpp-parity-full
 *
 * Implements decode-style matvec/matmul where weights are Q6_K and the
 * activations are quantized on-the-fly to Q8_K. This is inference-only;
 * no backward pass is provided here.
 *
 * Q6_K Format (256 weights per block):
 *   - d: FP16 super-block scale
 *   - ql: 128 bytes (low 4 bits of each weight)
 *   - qh: 64 bytes (high 2 bits of each weight)
 *   - scales: 16 int8 sub-block scales
 *
 * Q8_K Format (256 weights per block):
 *   - d: FP32 scale
 *   - qs: 256 int8 values
 *   - bsums: 16 int16 block sums
 */

Q6_K × Q8_K AVX2 inner loop — unpack, offset correction, scale, accumulate

        for (int j = 0; j < QK_K / 128; ++j) {
            const __m128i scale_0 = _mm_shuffle_epi8(scales, get_scale_shuffle_avx2(is + 0));
            const __m128i scale_1 = _mm_shuffle_epi8(scales, get_scale_shuffle_avx2(is + 1));
            const __m128i scale_2 = _mm_shuffle_epi8(scales, get_scale_shuffle_avx2(is + 2));
            const __m128i scale_3 = _mm_shuffle_epi8(scales, get_scale_shuffle_avx2(is + 3));
            is += 4;

            const __m256i q4bits1 = _mm256_loadu_si256((const __m256i *)q4);
            q4 += 32;
            const __m256i q4bits2 = _mm256_loadu_si256((const __m256i *)q4);
            q4 += 32;
            const __m256i q4bitsH = _mm256_loadu_si256((const __m256i *)qh);
            qh += 32;

            const __m256i q4h_0 = _mm256_slli_epi16(_mm256_and_si256(q4bitsH, m2), 4);
            const __m256i q4h_1 = _mm256_slli_epi16(_mm256_and_si256(_mm256_srli_epi16(q4bitsH, 2), m2), 4);
            const __m256i q4h_2 = _mm256_slli_epi16(_mm256_and_si256(_mm256_srli_epi16(q4bitsH, 4), m2), 4);
            const __m256i q4h_3 = _mm256_slli_epi16(_mm256_and_si256(_mm256_srli_epi16(q4bitsH, 6), m2), 4);

            const __m256i q4_0 = _mm256_or_si256(_mm256_and_si256(q4bits1, m4), q4h_0);
            const __m256i q4_1 = _mm256_or_si256(_mm256_and_si256(q4bits2, m4), q4h_1);
            const __m256i q4_2 = _mm256_or_si256(_mm256_and_si256(_mm256_srli_epi16(q4bits1, 4), m4), q4h_2);
            const __m256i q4_3 = _mm256_or_si256(_mm256_and_si256(_mm256_srli_epi16(q4bits2, 4), m4), q4h_3);

            const __m256i q8_0 = _mm256_loadu_si256((const __m256i *)q8);
            q8 += 32;
            const __m256i q8_1 = _mm256_loadu_si256((const __m256i *)q8);
            q8 += 32;
            const __m256i q8_2 = _mm256_loadu_si256((const __m256i *)q8);
            q8 += 32;
            const __m256i q8_3 = _mm256_loadu_si256((const __m256i *)q8);
            q8 += 32;

            /* Compute -32 * q8 contribution */
            __m256i q8s_0 = _mm256_maddubs_epi16(m32s, q8_0);
            __m256i q8s_1 = _mm256_maddubs_epi16(m32s, q8_1);
            __m256i q8s_2 = _mm256_maddubs_epi16(m32s, q8_2);
            __m256i q8s_3 = _mm256_maddubs_epi16(m32s, q8_3);

            /* Multiply q4 * q8 */
            __m256i p16_0 = _mm256_maddubs_epi16(q4_0, q8_0);
            __m256i p16_1 = _mm256_maddubs_epi16(q4_1, q8_1);
            __m256i p16_2 = _mm256_maddubs_epi16(q4_2, q8_2);
            __m256i p16_3 = _mm256_maddubs_epi16(q4_3, q8_3);

            /* Subtract offset: (q4 - 32) * q8 = q4*q8 - 32*q8 */
            p16_0 = _mm256_sub_epi16(p16_0, q8s_0);
            p16_1 = _mm256_sub_epi16(p16_1, q8s_1);
            p16_2 = _mm256_sub_epi16(p16_2, q8s_2);
            p16_3 = _mm256_sub_epi16(p16_3, q8s_3);

            /* Apply scales */
            p16_0 = _mm256_madd_epi16(_mm256_cvtepi8_epi16(scale_0), p16_0);
            p16_1 = _mm256_madd_epi16(_mm256_cvtepi8_epi16(scale_1), p16_1);
            p16_2 = _mm256_madd_epi16(_mm256_cvtepi8_epi16(scale_2), p16_2);
            p16_3 = _mm256_madd_epi16(_mm256_cvtepi8_epi16(scale_3), p16_3);

            sumi = _mm256_add_epi32(sumi, _mm256_add_epi32(p16_0, p16_1));
            sumi = _mm256_add_epi32(sumi, _mm256_add_epi32(p16_2, p16_3));
        }

        acc = _mm256_fmadd_ps(_mm256_broadcast_ss(&d), _mm256_cvtepi32_ps(sumi), acc);
    }

    /* Horizontal sum */
    __m128 hi = _mm256_extractf128_ps(acc, 1);
    __m128 lo = _mm256_castps256_ps128(acc);
    __m128 sum128 = _mm_add_ps(hi, lo);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    return _mm_cvtss_f32(sum128);

Q4_K × Q8_K VNNI byte-dot microkernel

static inline int32_t dot_q4_k_q8_k_32_vnni(const uint8_t *q4_packed_32,
                                            const int8_t *q8_32,
                                            int high_nibble) {
    const __m256i packed = _mm256_loadu_si256((const __m256i *)q4_packed_32);
    const __m256i mask4 = _mm256_set1_epi8(0x0F);
    const __m256i q4_bytes = high_nibble
        ? _mm256_and_si256(_mm256_srli_epi16(packed, 4), mask4)
        : _mm256_and_si256(packed, mask4);
    const __m256i q8_bytes = _mm256_loadu_si256((const __m256i *)q8_32);
    __m256i acc = _mm256_setzero_si256();
    acc = _mm256_dpbusd_epi32(acc, q4_bytes, q8_bytes);
    return hsum256_epi32(acc);

Quantized SIMD flow showing packed weights unpacked and multiplied in registers before horizontal reduction.

There are three layers of work here. First, unpack the bit-packed weight representation. Second, correct the zero-point or implicit offset in registers. Third, immediately multiply against quantized activations and accumulate before anything touches a scalar loop.

The impressive part is not just the dot product. It is that dequantization never escapes into a slow scalar staging buffer.

The AVX2 path is especially clear: _mm256_maddubs_epi16 handles unsigned-times-signed byte products, the -32 correction is applied in SIMD, _mm256_cvtepi8_epi16 widens scale bytes, and then _mm256_fmadd_ps converts the integer partials into floating accumulation. The later VNNI path goes even further with _mm256_dpbusd_epi32, which is effectively a dedicated quantized dot-product instruction.

Quantized GEMV public dispatch — reference by default, SIMD for controlled benchmarking

void gemv_q6_k_q8_k(float *y,
                     const void *W,
                     const void *x_q8,
                     int M, int K)
{
    const char *simd_env = getenv("CK_DEBUG_Q6K_Q8K_SIMD");
    if (simd_env && simd_env[0] && simd_env[0] != '0') {
#if defined(__AVX512VBMI__)
        gemv_q6_k_q8_k_avx512_vbmi(y, W, x_q8, M, K);
        return;
#elif defined(__AVX512F__)
        gemv_q6_k_q8_k_avx512(y, W, x_q8, M, K);
        return;
#elif defined(__AVX2__)
        gemv_q6_k_q8_k_avx2(y, W, x_q8, M, K);
        return;
#elif defined(__AVX__)
        gemv_q6_k_q8_k_avx(y, W, x_q8, M, K);
        return;
#elif defined(__SSE4_1__)
        gemv_q6_k_q8_k_sse(y, W, x_q8, M, K);
        return;
#endif
    }
    gemv_q6_k_q8_k_ref(y, W, x_q8, M, K);
}

Note the dispatch policy: the public entry point stays on the reference reduction order unless a debug environment flag explicitly requests the SIMD variant. That is not hesitation. It is production discipline around parity and long-context drift.

Memory Bandwidth: The Real Bottleneck

Once the arithmetic is vectorized well, bandwidth often becomes the dominant limit. In batch-1 decode, GEMV-like access patterns pull one weight row per output element. For a 4096-dimensional FP32 row that is roughly 16 KB of weights per output channel before you count activation traffic. SIMD makes the math cheap enough that moving those bytes becomes the next problem.

\text{row bytes}=4096 \times 4 = 16384\text{ bytes}\approx16\text{ KB}

Representation	Approx bytes for 4096 weights	What it buys
FP32	16 KB	Simple math, high bandwidth pressure.
Q6_K	≈ 6 KB plus scales/metadata	Substantial traffic reduction, more dequant work.
Q4_K	≈ 4 KB plus metadata	Even less traffic, even more unpack logic.

Q6_K file comment — why the packed layout exists at all

/**
 * @file gemm_kernels_q6k_q8k.c
 * @brief Q6_K (weights) x Q8_K (activations) kernels for inference
 *
 * CK-ENGINE KERNEL RULES:
 * =======================
 * 1. NO malloc/free - memory via bump allocator, pointers passed in
 * 2. NO OpenMP - parallelization at orchestrator/codegen layer
 * 3. API must define: inputs, outputs, workspace, and memory layouts
 * 4. Pure computation - deterministic, no side effects
 *
 * After changes: make test && make llamacpp-parity-full
 *
 * Implements decode-style matvec/matmul where weights are Q6_K and the
 * activations are quantized on-the-fly to Q8_K. This is inference-only;
 * no backward pass is provided here.
 *
 * Q6_K Format (256 weights per block):
 *   - d: FP16 super-block scale
 *   - ql: 128 bytes (low 4 bits of each weight)
 *   - qh: 64 bytes (high 2 bits of each weight)
 *   - scales: 16 int8 sub-block scales
 *
 * Q8_K Format (256 weights per block):
 *   - d: FP32 scale
 *   - qs: 256 int8 values
 *   - bsums: 16 int16 block sums
 */

Quantization is therefore a memory-system optimization as much as a model-size optimization. You accept more work per loaded byte so the CPU can do fewer cache-line fills for the same token.

bandwidth-bound The better the SIMD gets, the easier it becomes to discover that the next bottleneck is weight movement rather than arithmetic.

This is also where the kernel rules from the opening matter again. No allocator churn in the hot path. No ad hoc threading inside the loop nest. Stable layouts and predictable strides are what give the hardware prefetchers and caches a chance to help.

Parity and Precision: Why Scalar Still Matters

Vectorization changes floating-point behavior even when the source formula is unchanged. FMA has different rounding behavior from a split multiply plus add. Horizontal reduction changes accumulation order. Quantized VNNI kernels can move borderline logits when accumulation order changes enough. C-Kernel-Engine addresses that directly rather than pretending the issue does not exist.

Strict-parity hooks from RMSNorm and DeltaNet

    const char *exact_env = getenv("CK_RMSNORM_EXACT");
    if (ck_strict_parity_enabled() || (exact_env && atoi(exact_env) != 0)) {
        rmsnorm_forward_strict_scalar(input, gamma, output, rstd_cache, T, D, aligned, eps);
        return;
    }

static int ck_deltanet_force_ref(void)
{
    const char *env = getenv("CK_DELTANET_FORCE_REF");
    return env && atoi(env) != 0;
}

const char *gated_deltanet_impl_name(void)
{
    if (ck_strict_parity_enabled() || ck_deltanet_force_ref()) {
        return "REF";
    }

AdamW strict fp64 path again — parity is an implementation tier, not a test harness afterthought

    if (ck_strict_parity_enabled()) {
        const double beta1_d = (double)beta1;
        const double beta2_d = (double)beta2;
        const double one_minus_beta1_d = 1.0 - beta1_d;
        const double one_minus_beta2_d = 1.0 - beta2_d;
        const double lr_d = (double)lr;
        const double eps_d = (double)eps;
        const double wd_d = (double)weight_decay;

        const double bc1 = 1.0 - pow(beta1_d, (double)step);
        const double bc2 = 1.0 - pow(beta2_d, (double)step);
        const double step_size = lr_d / bc1;
        const double bc2_sqrt = sqrt(bc2);
        const double wd_scale = 1.0 - lr_d * wd_d;

        for (size_t i = 0; i < numel; ++i) {
            double g = (double)grad[i];
            double w = (double)weight[i];
            double m_i = (double)m[i];
            double v_i = (double)v[i];

            m_i = beta1_d * m_i + one_minus_beta1_d * g;
            v_i = beta2_d * v_i + one_minus_beta2_d * g * g;

            // Match PyTorch AdamW op order: decoupled weight decay first.
            w *= wd_scale;

            double denom = sqrt(v_i) / bc2_sqrt + eps_d;
            w -= step_size * (m_i / denom);

            m[i] = (float)m_i;
            v[i] = (float)v_i;
            weight[i] = (float)w;
        }
        return;

VNNI caution comment — correctness first

    /* Correctness first: the fast VNNI path changes the float accumulation
     * order enough to move borderline Qwen3.5 logits. Keep production on the
     * llama-style scalar accumulation unless explicitly benchmarking the fast
     * path with CK_ENABLE_Q4K_Q8K_VNNI_FAST=1.
     */
    gemv_q4_k_q8_k_ref(y, W, x_q8, M, K);

The long-horizon warning in the AdamW header is the kind of note only a real production kernel engineer writes. It does not say “SIMD is faster, trust us.” It says that all-FP32 SIMD can drift after enough optimizer steps and that the fp64 scalar path should be used to validate parity.

Auditable software does not hide floating-point trade-offs. It annotates them, exposes switches for them, and tests against them.

For a CPU vendor, this is a very strong competence signal. It says the author understands that ISA-level optimization is never only about throughput. It is equally about reproducibility, explainability, and controlled deviation from a reference implementation.

The Kernel File Strategy

The repository organization reflects the same philosophy as the dispatch logic: name the hardware path and the tensor contract explicitly. In the current tree there are 83 kernel .c files, 49 with explicit SIMD intrinsics, 27 in the quantization family, 17 centered on reduced precision, and 12 whose filenames already encode ISA specializations such as _sse, _avx, _avx2, _avx512, _vnni, and _amx.

Organization axis	Examples from CKE	Why it matters
Operation	`rmsnorm`, `attention`, `rope`, `swiglu`, `deltanet`, `optimizer`	One kernel family per mathematical contract.
Quantization	`q4_0`, `q4k`, `q5_k`, `q6k`, `q8_0`	Data layout is visible in the filename.
ISA	`_sse`, `_avx`, `_avx2`, `_avx512`, `_vnni`, `_amx`	Hardware-specific code is isolated and reviewable.
Precision	`bf16`, `f16`, FP32, int8	Numerical policy is separated from algorithmic policy.

This layout is not just tidy. It is what lets you benchmark and review kernels like a hardware team. When a file is named gemm_kernels_q4k_q8k_vnni.c, you already know the data format, the operation, and the intended instruction family before opening it.

49 files A large fraction of the repository is already organized around explicit ISA-aware execution, not generic scalar fallback code.

What ARM SVE and Future ISAs Mean

The future-looking lesson is not “rewrite everything in a different language.” It is that the existing C-Kernel-Engine architecture already matches how new CPUs want to be targeted. Pure C kernels, explicit dispatch, stable tensor contracts, and no framework dependencies are exactly what scalable-vector and matrix-extension hardware needs.

Q6_K kernel includes both x86 and ARM SIMD headers

/* Include SIMD headers based on available extensions */
#if defined(__AVX512F__) || defined(__AVX2__) || defined(__AVX__) || defined(__SSE4_1__) || defined(__SSSE3__)
#include <immintrin.h>
#endif
#if defined(__ARM_NEON) || defined(__aarch64__)
#include <arm_neon.h>
#endif

ARM feature hooks in ck_features.h

/* ARM */
#if defined(__aarch64__)
    #if defined(__ARM_FEATURE_SVE2)
        #define CK_HAS_SVE2 1
    #endif
    #if defined(__ARM_FEATURE_NEON)
        #define CK_HAS_NEON 1
    #endif
#endif

/* Best available vector width */
#if defined(CK_HAS_AMX)
    #define CK_VECTOR_WIDTH 512
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX512)
    #define CK_VECTOR_WIDTH 512
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX2_FMA)
    #define CK_VECTOR_WIDTH 256
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_AVX)
    #define CK_VECTOR_WIDTH 256
    #define CK_HAS_BEST_VECTOR 1
#elif defined(CK_HAS_NEON)
    #define CK_VECTOR_WIDTH 128
    #define CK_HAS_BEST_VECTOR 1
#else
    #define CK_VECTOR_WIDTH 32  /* Scalar fallback */
    #define CK_HAS_BEST_VECTOR 0
#endif

Future ISA roadmap connecting x86 SIMD to ARM SVE2, AMX, and future matrix extensions for AI kernels.

SVE changes one especially important detail: vector length becomes hardware-chosen rather than baked into the binary as 128, 256, or 512 bits. That pushes software toward vector-length-agnostic loops and predication instead of scalar tails. But the higher-level kernel architecture barely changes.

The dispatch strategy survives the ISA change. Only the vector abstraction changes.

That is why CKE is a plausible fit for future ARM server CPUs as well as new x86 extensions such as AMX. The codebase already thinks in terms of “what exact hardware primitive does this kernel tier expose?” which is the right question whether the answer is FMA, VNNI, SVE2 predication, or tile ops.

Why this matters to ARM, Intel, and AMD teams

A kernel repository is most convincing when it already separates algorithm, data layout, dispatch policy, and ISA specialization. That is what makes a port to SVE2, SME, or AMX a focused engineering project instead of a rewrite.

In other words: the optimization surface is already the ISA. New hardware simply adds a new tier to the same disciplined ladder.

Conclusion: The ISA Is the Optimization Surface

“CPUs are slow for AI” is usually a statement about software quality, not a law of physics. The kernels in this post show the real ladder: scalar reference for truth, SSE2 for the first vector baseline, AVX for width, AVX2+FMA for collapsed arithmetic, AVX-512 for more width plus better reductions, and VNNI or AMX style extensions for ever more specialized inner loops.

C-Kernel-Engine makes that ladder explicit. The same RMSNorm, AdamW, DeltaNet, and quantized GEMV algorithms are visible at multiple ISA tiers, with parity hooks and environment overrides making every optimization testable. That is exactly the kind of implementation discipline CPU vendors should care about.

The closing claim is therefore simple. CPU-first AI inference is not a speculative future. It is already here wherever the kernel layer is written with enough respect for the ISA.

Good AI software does not fight the instruction set. It treats the instruction set as the product surface.

Series continuity

This post connected the RMSNorm story, the optimizer mechanics, and the DeltaNet recurrent kernel into one hardware-centric view.

If you want the visual companion version, the short-form walkthrough is on youtube.com/@antshivrobotics.

SIMD Deep Dive: How AI Kernels Use SSE, AVX, AVX-512, and VNNI

What this post covers

Why SIMD Matters for AI on CPU

What Is SIMD? Single Instruction, Multiple Data

The x86 SIMD Evolution

CKE's Dispatch Pattern

Why this matters to an ISA reviewer

RMSNorm Forward: Scalar → AVX → AVX-512

AdamW Optimizer: SSE2 / AVX / AVX-512

The Horizontal Reduction Problem

DeltaNet State Sweep: AVX2 Row-Pair Unroll

FMA: The Single Most Important Instruction

Quantized SIMD: Dequant + Compute in Registers

Memory Bandwidth: The Real Bottleneck

Parity and Precision: Why Scalar Still Matters

The Kernel File Strategy

What ARM SVE and Future ISAs Mean

Why this matters to ARM, Intel, and AMD teams

Conclusion: The ISA Is the Optimization Surface

Series continuity

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support

SIMD Deep Dive: How AI Kernels Use SSE, AVX, AVX-512, and VNNI

What this post covers

Why SIMD Matters for AI on CPU

What Is SIMD? Single Instruction, Multiple Data

The x86 SIMD Evolution

CKE's Dispatch Pattern

Why this matters to an ISA reviewer

RMSNorm Forward: Scalar → AVX → AVX-512

AdamW Optimizer: SSE2 / AVX / AVX-512

The Horizontal Reduction Problem

DeltaNet State Sweep: AVX2 Row-Pair Unroll

FMA: The Single Most Important Instruction

Quantized SIMD: Dequant + Compute in Registers

Memory Bandwidth: The Real Bottleneck

Parity and Precision: Why Scalar Still Matters

The Kernel File Strategy

What ARM SVE and Future ISAs Mean

Why this matters to ARM, Intel, and AMD teams

Conclusion: The ISA Is the Optimization Surface

Series continuity

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support