Quantization Deep Dive: How CPU Kernels Compress Weights and Preserve Accuracy

Quantization deep dive · ML fundamentals

This ShivasNotes deep dive is written for CPU silicon teams evaluating whether C-Kernel-Engine really owns its quantized inference path. The answer is yes: quantization in C-Kernel-Engine is handwritten kernel code, documented at the quantization deep dive and the format reference, with the walkthrough companion on youtube.com/@antshivrobotics@shivasnotes.

The thesis of this post is simple: quantization is the CPU story. If inference is memory-bound, then the decisive engineering question is not just “how many FLOPs can the core issue?” but “how aggressively can the runtime compress weights, preserve accuracy, and keep dequantized values inside registers instead of memory?” For a silicon team, this is the tell: CKE does not wave at quantization as a checkbox. It names the block format, the byte packing, the dispatch rule, the ISA tier, and even the parity bug that only showed up under real model bring-up.

What this post covers

First, the post frames quantization as the practical way CPUs turn large models from impossible to deployable. Then it walks the actual CKE formats and the mixed-quant runtime path that matters in production.

The second half moves from formats to engineering reality: dequantization in registers, ISA-specific kernels from SSE through AVX-512 VNNI and AMX, the March 2026 Q8_K SSE parity fix, cache-line geometry, allocator integration, and what these choices signal to ARM, Intel, and AMD.

Introduction — Why Quantization Is the CPU Story

AI inference on CPU is usually memory-bound long before it is arithmetic-bound. At batch size 1, decode looks like repeated matrix-vector multiplies with low arithmetic intensity. That means the system wins when it moves fewer bytes, not when it merely advertises higher peak FLOPs.

That is why quantization is central rather than optional. A 7B model in FP32 weighs roughly 28 GB. Put the same model into a Q4_K-style footprint and the memory story drops to roughly 3.5 GB. That is the difference between “does not fit in the practical deployment box” and “runs in one coherent CPU memory space.”

CKE’s quantization story is therefore not “we call a vendor library.” It is a kernel surface: dedicated per-format GEMV and GEMM files, dequantization helpers, activation quantizers, fused quantized attention pieces, and ISA-tiered implementations that exist because the runtime owns the layout contract end to end.

That ownership matters for silicon review. When a runtime fully owns quantization, the evaluator can see exactly where SSE ends, where AVX2 begins, where VNNI changes the inner loop, where AMX becomes relevant, and where correctness is protected by strict parity fallbacks. 32 files The public framing for this post is roughly 32 quantization-focused kernel files and 11.6k+ lines of handwritten C across scalar, SSE, AVX, AVX2, AVX-512, VNNI, AMX, and ARM NEON paths.

Quantization turning large-model inference from a memory-capacity problem into a practical CPU deployment path.

Why quantization matters — memory footprint, not just arithmetic throughput

text

7B parameters:
  FP32  -> 7e9 × 4 bytes  ≈ 28 GB
  INT8  -> 7e9 × 1 byte   ≈  7 GB
  INT4  -> 7e9 × 0.5 byte ≈  3.5 GB

Decode reality:
  arithmetic intensity is low
  memory traffic dominates
  fewer bytes moved = faster practical inference

Quantization kernel surface discussed in this post

text

Format kernels:
  q4_0, q4_1, q5_0, q5_1, q5_k, q6_k, q8_0, q4_k, q8_k

Support kernels:
  dequant_kernels.c
  quantize_row_q8_k_*.c
  rmsnorm_kernels_int4.c
  rmsnorm_kernels_int8.c

ISA-specialized paths:
  SSE / SSSE3
  AVX
  AVX2
  AVX-512
  AVX-512 VNNI
  AMX
  ARM NEON

Representative quantization file inventory discussed in this post

text

Per-format kernels:
  gemm_kernels_q4_0.c
  gemm_kernels_q4_1.c
  gemm_kernels_q5_0.c
  gemm_kernels_q5_1.c
  gemm_kernels_q5_k.c
  gemm_kernels_q6k.c
  gemm_kernels_q8_0.c
  gemm_kernels_q4k.c
  gemm_kernels_q4k_q8k.c
  gemm_kernels_q6k_q8k.c

Dispatch helpers and specialization files:
  gemm_kernels_q4k_sse.c
  gemm_kernels_q4k_avx.c
  gemm_kernels_q4k_q8k_avx2.c
  gemm_kernels_q4k_q8k_vnni.c
  gemm_kernels_amx.c
  quantize_row_q8_k_sse.c
  quantize_row_q8_k_avx.c
  quantize_row_q8_k_avx2.c
  quantize_row_q8_k_avx512.c

Representative current line counts from the CKE tree

text

352   gemm_kernels_q4_0.c
304   gemm_kernels_q4_1.c
1753  gemm_kernels_q5_0.c
369   gemm_kernels_q5_1.c
687   gemm_kernels_q5_k.c
736   gemm_kernels_q4k.c
400   gemm_kernels_q4k_q8k.c
1228  gemm_kernels_q6k_q8k.c
275   gemm_kernels_amx.c
100   quantize_row_q8_k_sse.c
542   dequant_kernels.c
140   rmsnorm_kernels_int4.c
123   rmsnorm_kernels_int8.c

The rest of this post treats quantization as a systems problem rather than a compression trick. The key questions are: how are bytes packed, how are scales represented, how does dispatch select the correct hot loop, and what kinds of bugs only show up when strict per-op parity is enforced.

Block Quantization Fundamentals

Single-scale quantization sounds simple: find the maximum absolute value, map the whole tensor into an integer range, and dequantize later by multiplying with one scale. For neural weights, that often fails because one large outlier forces the step size high enough that the small but important values collapse to zero.

Block quantization fixes that by grouping weights into small blocks and giving each block its own scale. In CKE’s simple formats that usually means 32 weights per block. In the K-quants it means 256-weight super-blocks with nested sub-scales.

This also explains the split between symmetric and asymmetric formats. Symmetric quantization assumes values cluster around zero and stores only a scale. Asymmetric quantization stores both a scale and an offset or minimum, which makes it more expensive in bytes but more accurate for distributions that are not centered.

The engineering trade is always the same: smaller groups improve local fidelity, but every group needs metadata. That metadata is why block layout is as important as bit width. Compression only helps if the scale overhead stays small and the kernel can decode it cheaply. A good quantization format does two things at once: it protects the model from catastrophic information loss and it stays friendly to the CPU’s load, unpack, multiply, and reduce instructions.

Concept	What it means	Why it matters in kernels
Single scale	One scale for a large tensor or row.	Cheap metadata, but small weights often quantize to zero.
Per-block scale	One scale for each 32-weight or 256-weight block.	Preserves local detail with manageable overhead.
Symmetric	Centered around zero, store scale only.	Simple decode, ideal for Q4_0 / Q5_0 / Q8_0 style paths.
Asymmetric	Store scale and offset / minimum.	Better fit for skewed ranges, but decode has an extra term.
K-quant	Super-block plus nested sub-scales.	Higher compression efficiency without per-32 FP16 overhead.

Symmetric quantization math

text

Quantize:
  scale = max(abs(weights)) / (2^(bits-1) - 1)
  q = round(weight / scale)

Dequantize:
  weight ≈ q × scale

Example intuition:
  INT4 symmetric range  ≈ [-8, 7]
  INT8 symmetric range  ≈ [-127, 127]

Asymmetric quantization math

text

Quantize:
  scale = (max - min) / (2^bits - 1)
  q = round((weight - min) / scale)

Dequantize:
  weight ≈ q × scale + min

CKE examples:
  Q4_1  -> q × d + m
  Q5_1  -> q × d + m

Why grouping wins

text

Single-scale failure:
  large outlier raises scale
  tiny weights round to 0

Per-block grouping:
  each block gets local scale
  tiny weights can still occupy multiple integer levels
  metadata cost stays bounded

Block quantization grouping showing why per-block scales protect small but important weights from collapsing to zero.

the SIMD deep dive framed the SIMD question as “same math, wider registers, fewer instructions.” Quantization adds a new layer: same math, fewer bytes, more careful unpacking. the ARM NEON kernel post showed the ARM version of this same story. Here the focus is the format contract itself.

The Format Landscape — 8 Formats in CKE

The public CKE quantization docs currently center eight formats: five simple 32-weight formats and three 256-weight K-quant formats. Those are the formats a silicon reviewer should understand first because they define the on-disk layout, the dequant math, and the runtime dispatch surface.

There is one subtle repository detail worth calling out. The kernel tree also contains gemm_kernels_q5_k.c. In other words, the compute surface already services an additional K-quant tier beyond the eight-format headline table. The point of this post, though, is the documented core set and the mixed-quant path that drives production inference.

Format	Bits / weight	Block size	Structure	Typical role
`Q4_0`	4.5	32	2-byte FP16 scale + 16 packed bytes	Simplest symmetric INT4 weight path.
`Q4_1`	5.0	32	FP16 scale + FP16 min + 16 packed bytes	Asymmetric INT4 weights.
`Q5_0`	5.5	32	FP16 scale + 4-byte high-bit field + 16 packed bytes	5-bit symmetric weights.
`Q5_1`	6.0	32	Scale + min + high-bit field + packed bytes	5-bit asymmetric weights.
`Q8_0`	8.5	32	FP16 scale + 32 signed bytes	High-fidelity simple quant format.
`Q4_K`	4.5	256	FP16 `d`, FP16 `dmin`, 12-byte packed scales, 128 packed bytes	Primary compression-first K-quant weight format.
`Q6_K`	6.5625	256	128-byte low bits + 64-byte high bits + 16 int8 scales + FP16	Higher-fidelity K-quant weights.
`Q8_K`	9.125	256	FP32 scale + 256 int8 values + 16 block sums	Activation-side bridge for mixed K-quant kernels.

block_q4_0 from include/ckernel_quant.h

#define QK4_0 32

typedef struct {
    ck_half d;             /* 2 bytes: scale (delta) */
    uint8_t qs[QK4_0 / 2]; /* 16 bytes: 32 x 4-bit weights (2 per byte) */
} block_q4_0;
/* Total: 18 bytes per 32 weights */

block_q8_0 from include/ckernel_quant.h

#define QK8_0 32

typedef struct {
    ck_half d;        /* 2 bytes: scale */
    int8_t qs[QK8_0]; /* 32 bytes: 32 x 8-bit signed weights */
} block_q8_0;
/* Total: 34 bytes per 32 weights */

block_q5_0 from include/ckernel_quant.h

#define QK5_0 32

typedef struct {
    ck_half d;             /* 2 bytes: scale (delta) */
    uint8_t qh[4];         /* 4 bytes: high 1-bit of each weight (32 bits total) */
    uint8_t qs[QK5_0 / 2]; /* 16 bytes: low 4-bits of 32 weights (2 per byte) */
} block_q5_0;
/* Total: 22 bytes per 32 weights */

block_q5_1 from include/ckernel_quant.h

#define QK5_1 32

typedef struct {
    ck_half d;             /* 2 bytes: scale (delta) */
    ck_half m;             /* 2 bytes: minimum */
    uint8_t qh[4];         /* 4 bytes: high 1-bit of each weight (32 bits total) */
    uint8_t qs[QK5_1 / 2]; /* 16 bytes: low 4-bits of 32 weights (2 per byte) */
} block_q5_1;
/* Total: 24 bytes per 32 weights */

block_q6_K from include/ckernel_quant.h

typedef struct {
    uint8_t ql[QK_K / 2];      /* 128 bytes: low 4 bits */
    uint8_t qh[QK_K / 4];      /* 64 bytes: high 2 bits */
    int8_t scales[QK_K / 16];  /* 16 bytes: 16 sub-block scales */
    ck_half d;                 /* 2 bytes: super-block scale */
} block_q6_K;
/* Total: 210 bytes per 256 weights */

block_q4_K from include/ckernel_quant.h

#define QK_K 256
#define K_SCALE_SIZE 12

typedef struct {
    ck_half d;                    /* 2 bytes: super-block scale */
    ck_half dmin;                 /* 2 bytes: super-block minimum */
    uint8_t scales[K_SCALE_SIZE]; /* 12 bytes: 8 sub-block scales + 8 sub-block mins (6-bit packed) */
    uint8_t qs[QK_K / 2];         /* 128 bytes: 256 x 4-bit weights */
} block_q4_K;
/* Total: 144 bytes per 256 weights */

block_q8_K from include/ckernel_quant.h

typedef struct {
    float d;                  /* 4 bytes: scale */
    int8_t qs[QK_K];          /* 256 bytes: 256 x 8-bit signed weights */
    int16_t bsums[QK_K / 16]; /* 32 bytes: block sums for optimization */
} block_q8_K;
/* Total: 292 bytes per 256 weights */

dequant_q5_0_block() from src/kernels/dequant_kernels.c

void dequant_q5_0_block(const block_q5_0 *block, float *output)
{
    const float d = GGML_FP16_TO_FP32(block->d);

    uint32_t qh;
    memcpy(&qh, block->qh, sizeof(qh));

    for (int j = 0; j < QK5_0 / 2; j++) {
        const uint8_t packed = block->qs[j];
        const int lo = (packed & 0x0F);
        const int hi = (packed >> 4);
        const int xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
        const int xh_1 = ((qh >> (j + 12))) & 0x10;
        const int q0 = (lo | xh_0) - 16;
        const int q1 = (hi | xh_1) - 16;

        output[j] = d * (float)q0;
        output[j + 16] = d * (float)q1;
    }
}

dequant_q6_k_block() from src/kernels/dequant_kernels.c

void dequant_q6_k_block(const block_q6_K *block, float *output)
{
    const float d = GGML_FP16_TO_FP32(block->d);
    const uint8_t *ql = block->ql;
    const uint8_t *qh = block->qh;
    const int8_t *sc = block->scales;
    float *y = output;

    for (int n = 0; n < QK_K; n += 128) {
        for (int l = 0; l < 32; ++l) {
            const int is = l / 16;
            const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
            const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
            const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
            const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;

            y[l + 0] = d * (float)sc[is + 0] * (float)q1;
            y[l + 32] = d * (float)sc[is + 2] * (float)q2;
            y[l + 64] = d * (float)sc[is + 4] * (float)q3;
            y[l + 96] = d * (float)sc[is + 6] * (float)q4;
        }
        y += 128;
        ql += 64;
        qh += 32;
        sc += 8;
    }
}

The most important format-level distinction is that Q8_K uses an FP32 scale, not FP16. That is not cosmetic. It tells you CKE treats the activation bridge with extra numerical care because mixed-quant boundaries amplify small contract mismatches. 292 B Q8_K spends more bytes than a simple 8-bit block because it is not just a storage format; it is a compute contract for the mixed K-quant path.

CKE quantization format landscape comparing simple 32-weight blocks with 256-weight K-quant super-blocks.

Q4_0 — The Simplest Format

Q4_0 is the cleanest starting point because it makes the packing logic obvious. One FP16 scale. Sixteen data bytes. Two 4-bit weights per byte. Each nibble becomes a signed integer in the range [-8, 7], multiplied by the block scale.

The key observation is that the physical byte order is not the logical element order a newcomer might assume. In CKE’s dequantizer, the lower nibble maps to one half of the block and the upper nibble maps to the other half. That choice keeps the layout compatible with GGML-style blocks and tells the kernel exactly how to unpack.

dequant_q4_0_block() from src/kernels/dequant_kernels.c

void dequant_q4_0_block(const block_q4_0 *block, float *output)
{
    const float d = GGML_FP16_TO_FP32(block->d);

    for (int i = 0; i < QK4_0 / 2; i++) {
        const uint8_t packed = block->qs[i];

        /* Lower nibble: elements 0..15 */
        const int8_t q0 = (packed & 0x0F) - 8;
        /* Upper nibble: elements 16..31 */
        const int8_t q1 = (packed >> 4) - 8;

        output[i] = d * (float)q0;
        output[i + QK4_0 / 2] = d * (float)q1;
    }
}

Q4_0 byte map

text

block_q4_0 = 18 bytes total

bytes 0..1   : d      (FP16 scale)
bytes 2..17  : qs[16] (32 weights packed as 4-bit nibbles)

Each qs[i] byte:
  low nibble  -> output[i]
  high nibble -> output[i + 16]

Cache-line geometry for Q4_0

text

Q4_0 block size = 18 bytes
64-byte cache line / 18 bytes ≈ 3.55 blocks
3.5 blocks × 32 weights ≈ 112 weights per line fetch

Interpretation:
  tiny metadata overhead
  dense weight packing
  strong bandwidth behavior for decode-style access

This is exactly why 4-bit quantization is so compelling on CPU. Even before any SIMD enters the frame, the memory subsystem gets a radically smaller working set. Then the kernel engineer’s task becomes unpacking those nibbles with minimal instruction overhead. 18 bytes That is the entire storage cost for 32 weights, including the scale metadata. The bandwidth story is already visible before a single FMA executes.

Q4_0 byte-level packing showing an FP16 scale followed by 16 bytes holding 32 signed 4-bit weights.

the SIMD deep dive focused on how wider registers reduce the instruction count for the same scalar math. Q4_0 adds a new lesson: quantized kernels begin with representation engineering. Before the vector multiply, the runtime has to know exactly which nibble means which weight.

Q4_K — K-Quant with Nested Scales

Q4_K is where quantization stops looking elementary and starts looking like serious runtime engineering. A single 256-weight super-block carries two FP16 values, a 12-byte packed metadata field, and 128 bytes of 4-bit data. The metadata does not merely hold one scale. It holds the ingredients for eight sub-block scales and eight sub-block minima.

That extra complexity exists for a reason. The K-quant idea is to preserve local fidelity without paying the metadata cost of one FP16 scale per 32-weight block. It compresses better than the simple formats while still giving the kernel enough local information to reconstruct usable floating-point values.

The common bug source is the sign on the minimum term. In Q4_K, the correct formula is weight = q × (d × sc) - (dmin × mn). The minus is not negotiable. Treating dmin like an additive bias yields numerically wrong outputs that can look superficially plausible until strict parity catches them.

unpack_q4_k_scales() from include/ckernel_quant.h

static inline void unpack_q4_k_scales(const uint8_t *scales,
                                       uint8_t *sc, uint8_t *m) {
    sc[0] = scales[0] & 0x3F;
    sc[1] = scales[1] & 0x3F;
    sc[2] = scales[2] & 0x3F;
    sc[3] = scales[3] & 0x3F;

    m[0] = scales[4] & 0x3F;
    m[1] = scales[5] & 0x3F;
    m[2] = scales[6] & 0x3F;
    m[3] = scales[7] & 0x3F;

    sc[4] = (scales[8]  & 0x0F) | ((scales[0] >> 6) << 4);
    sc[5] = (scales[9]  & 0x0F) | ((scales[1] >> 6) << 4);
    sc[6] = (scales[10] & 0x0F) | ((scales[2] >> 6) << 4);
    sc[7] = (scales[11] & 0x0F) | ((scales[3] >> 6) << 4);

    m[4] = (scales[8]  >> 4) | ((scales[4] >> 6) << 4);
    m[5] = (scales[9]  >> 4) | ((scales[5] >> 6) << 4);
    m[6] = (scales[10] >> 4) | ((scales[6] >> 6) << 4);
    m[7] = (scales[11] >> 4) | ((scales[7] >> 6) << 4);
}

dequant_q4_k_block() from src/kernels/dequant_kernels.c

void dequant_q4_k_block(const block_q4_K *block, float *output)
{
    const float d = GGML_FP16_TO_FP32(block->d);
    const float dmin = GGML_FP16_TO_FP32(block->dmin);

    uint8_t sc[8], m[8];
    unpack_q4_k_scales(block->scales, sc, m);

    for (int iter = 0; iter < 4; iter++) {
        const float d1 = d * (float)sc[2 * iter];
        const float m1 = dmin * (float)m[2 * iter];
        const float d2 = d * (float)sc[2 * iter + 1];
        const float m2 = dmin * (float)m[2 * iter + 1];

        const uint8_t *qs = &block->qs[iter * 32];
        float *out = &output[iter * 64];

        for (int l = 0; l < 32; l++) {
            const int q = (qs[l] & 0x0F);
            out[l] = d1 * (float)q - m1;
        }

        for (int l = 0; l < 32; l++) {
            const int q = (qs[l] >> 4);
            out[32 + l] = d2 * (float)q - m2;
        }
    }
}

Q4_K super-block mental model

text

256-weight super-block
  d      : FP16 super-scale
  dmin   : FP16 super-min scale
  scales : packed 6-bit sub-scale + sub-min fields
  qs     : 128 bytes of packed 4-bit values

Decode rule per sub-block:
  weight = q × (d × sc) - (dmin × mn)

This is the point where quantization becomes a real kernel-design problem. The decode is no longer just “extract nibble, subtract center, multiply by scale.” It is metadata unpacking, sub-block bookkeeping, and then an expression whose sign matters enough to destroy parity if implemented loosely. If you remember only one formula from this section, remember the minus sign. Q4_K subtracts dmin × mn. Adding it is a classic wrong-output trap.

Q4_K nested-scale super-block showing FP16 super-scales, packed 6-bit sub-scales and minima, and 128 bytes of packed weights.

K-quants exist because the simple formats leave quality on the table at the same nominal bit width. The engineering win is not just size reduction. It is better compression without surrendering the local dynamic range that modern LLM layers need.

Q8_K — The Activation-Side Bridge

The most common mistake in reading a quantized runtime is assuming every format is just “another way to store weights.” Q8_K is more important than that. In CKE’s current mixed-quant path, it is the activation-side bridge between FP32 hidden state and K-quant weight kernels.

That explains two unusual decisions. First, the scale is FP32, not FP16. Second, the struct carries bsums: precomputed sums of each 16-value chunk. Those block sums are not decorative metadata; they exist so kernels such as the VNNI path can cheaply handle offset terms without rescanning the activation bytes.

block_q8_K and its intent from include/ckernel_quant.h

/* Q8_K: K-Quant 8-bit (used for activations in some ops)
 * - 256 weights per super-block
 * - 1 FP32 scale per block (not FP16 like others!)
 */

typedef struct {
    float d;                  /* 4 bytes: scale */
    int8_t qs[QK_K];          /* 256 bytes: 256 x 8-bit signed weights */
    int16_t bsums[QK_K / 16]; /* 32 bytes: block sums for optimization */
} block_q8_K;

quantize_row_q8_k_sse() block-sum writeback from src/kernels/quantize_row_q8_k_sse.c

for (int j = 0; j < QK_K; j += 16) {
    ...
    _mm_storeu_si128((__m128i *)(y[i].qs + j), q0123);

    int sum = 0;
    for (int ii = 0; ii < 16; ++ii) {
        sum += y[i].qs[j + ii];
    }
    y[i].bsums[j / 16] = (int16_t)sum;
}

y[i].d = 1.0f / iscale;

Why Q8_K exists in the v7 path

text

FP32 hidden state
  -> quantize_row_q8_k
  -> Q8_K activation block
  -> gemv_q4_k_q8_k or gemv_q6_k_q8_k

Q8_K is not merely "8-bit weights".
It is the activation contract that mixed K-quant kernels expect.

This is the format that makes the modern runtime path possible. The weights remain compressed in Q4_K or Q6_K, while each token’s FP32 hidden state is quantized on the fly into a layout whose scale precision and block-sum side channel are tailored to the downstream kernel. FP32 scale That extra precision is one reason the mixed-quant boundary can stay numerically well-behaved—until a parity bug breaks the contract, as the Q8_K parity section shows.

If the ARM NEON kernel post showed that ARM NEON already participates in quantized inference, Q8_K explains why the mixed path is portable across ISA tiers. The activation contract is fixed; only the way each architecture consumes it changes.

The Mixed-Quant Runtime Path

The production runtime story is not “everything is statically quantized forever.” Instead, CKE holds the hidden state in FP32, quantizes that activation row into Q8_K, and then dispatches into the matching K-quant kernel for the weights. This is the modern mixed-quant path that matters for v7-style inference.

The central systems idea is separation of concerns. Tensors know their DType. Dispatch resolves the correct kernel before the hot loop begins. The kernel then runs against homogeneous, already-understood layouts instead of wasting cycles on type checks inside the inner loop.

DType enum shown in the CKE quantization docs

typedef enum {
    DTYPE_FP32,
    DTYPE_FP16,
    DTYPE_BF16,
    DTYPE_Q4_0,
    DTYPE_Q4_1,
    DTYPE_Q5_0,
    DTYPE_Q5_1,
    DTYPE_Q4_K,
    DTYPE_Q6_K,
    DTYPE_Q8_0,
    DTYPE_Q8_K
} DType;

select_linear_kernel() from the CKE quantization docs

const char *select_linear_kernel(DType weight, DType activation, bool prefill) {
    if (!prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemv_q4_k_q8_k";
    if (!prefill && weight == DTYPE_Q6_K && activation == DTYPE_Q8_K) return "gemv_q6_k_q8_k";
    if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemm_nt_q4_k_q8_k";
    if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_FP32) return "gemm_nt_q4_k";
    if (!prefill && weight == DTYPE_Q8_0 && activation == DTYPE_FP32) return "gemv_q8_0";
    return "fallback_or_error";
}

Operation lowering shown in the docs

void lower_mixed_quant_linear(...) {
    quantize_row_q8_k(hidden_fp32, hidden_q8k, K);
    gemv_q4_k_q8_k(out_fp32, weight_q4k, hidden_q8k, K);
}

The important phrase here is type checking at tensor level, never in the hot path. This is the same software discipline good compiler backends follow: decide the representation early, then let the inner loop focus exclusively on arithmetic and memory movement. For CPU vendors, dispatch clarity matters almost as much as kernel quality. If you cannot tell which loop is hot for which dtype pair, you cannot reason about why a given ISA extension helps.

Mixed-quant runtime path from FP32 hidden state to Q8_K activation blocks into Q4_K or Q6_K kernels.

The mixed path also explains why bugs at the quantization boundary are so dangerous. A mismatch in signed-max selection or block-sum semantics can survive casual generation tests while slowly drifting every downstream mixed-quant matvec.

Dequantization in Registers

The most important optimization in CKE’s quantized kernels is not any one instruction. It is the policy: dequantize into registers, use immediately in multiply-accumulate, never spill intermediate FP32 weights to memory. That policy is explicitly stated in dequant_kernels.c and materialized in the AVX and AVX-512 kernels.

This matters because a naive implementation would decode 4-bit or 6-bit weights into a temporary FP32 buffer, write that buffer to RAM, reload it for the dot product, and then finally compute the result. That destroys much of the bandwidth win that quantization created in the first place.

Design note from src/kernels/dequant_kernels.c

/*
 * Key optimization: Dequantize into registers, use immediately in FMA,
 * never write intermediate FP32 values to memory.
 */

gemv_q4_k_avx512() from src/kernels/gemm_kernels_q4k.c

void gemv_q4_k_avx512(float *y,
                      const void *W,
                      const float *x,
                      int M, int K)
{
    const block_q4_K *blocks = (const block_q4_K *)W;
    const int blocks_per_row = K / QK_K;

    for (int row = 0; row < M; row++) {
        __m512 acc = _mm512_setzero_ps();

        for (int b = 0; b < blocks_per_row; b++) {
            const block_q4_K *block = &blocks[row * blocks_per_row + b];
            const float d = GGML_FP16_TO_FP32(block->d);
            const float dmin = GGML_FP16_TO_FP32(block->dmin);

            uint8_t sc[8], m_arr[8];
            unpack_q4_k_scales(block->scales, sc, m_arr);

            const __m512i mask_lo = _mm512_set1_epi32(0x0F);

            for (int iter = 0; iter < 4; iter++) {
                const float d1 = d * (float)sc[2*iter];
                const float m1 = dmin * (float)m_arr[2*iter];
                const float d2 = d * (float)sc[2*iter + 1];
                const float m2 = dmin * (float)m_arr[2*iter + 1];

                const __m512 vscale1 = _mm512_set1_ps(d1);
                const __m512 vmin1 = _mm512_set1_ps(m1);
                const __m512 vscale2 = _mm512_set1_ps(d2);
                const __m512 vmin2 = _mm512_set1_ps(m2);

                const uint8_t *qs = &block->qs[iter * 32];
                const float *xp = &x[b * QK_K + iter * 64];

                for (int chunk = 0; chunk < 2; chunk++) {
                    __m128i packed = _mm_loadu_si128((const __m128i *)&qs[chunk * 16]);
                    __m512i bytes = _mm512_cvtepu8_epi32(packed);
                    __m512i lo = _mm512_and_epi32(bytes, mask_lo);
                    __m512 w = _mm512_fnmadd_ps(_mm512_set1_ps(1.0f), vmin1,
                               _mm512_mul_ps(_mm512_cvtepi32_ps(lo), vscale1));
                    __m512 x_vec = _mm512_loadu_ps(&xp[chunk * 16]);
                    acc = _mm512_fmadd_ps(w, x_vec, acc);
                }
            }
        }
    }
}

The anti-pattern CKE avoids

text

naive path:
  unpack q4 -> temporary fp32 buffer
  store temporary fp32 buffer to RAM
  reload fp32 buffer from RAM
  multiply by activation vector
  accumulate

CKE path:
  unpack q4 in registers
  scale in registers
  FMA immediately
  keep accumulator live in SIMD state

This optimization is why quantization and SIMD should be discussed together. Compression gives the core fewer bytes to fetch. In-register dequantization makes sure the runtime does not give that win back by manufacturing unnecessary FP32 traffic. 0 spills The ideal inner loop never materializes a dequantized weight buffer in memory. It consumes decoded values the moment they exist.

For AVX-512 specifically, this also meshes beautifully with the Q4_K block geometry. Sixteen FP32 lanes let the kernel decode and consume 16 weights at a time, then repeat across the low and high nibble halves of the block.

ISA Tiers for Quantized Kernels

One of the strongest signals in CKE is that quantized inference is not implemented in one generic file with a few opportunistic intrinsics. It is stratified by ISA tier. There are scalar fallbacks, SSE or SSSE3 files, AVX files, AVX2 files, AVX-512 files, VNNI-specific files, AMX files, and NEON paths for the formats already covered in the ARM NEON kernel post.

This is exactly what CPU vendors want to review: not just that an optimization exists, but that the codebase has a clear picture of which architecture deserves its own kernel.

ISA tier	Representative files	What changes in the kernel
SSE / SSSE3	`gemm_kernels_q4k_sse.c`, `gemm_kernels_q5_0_sse.c`, `gemm_kernels_q5_0_sse_v2.c`, `gemm_kernels_q6k_sse.c`, `quantize_row_q8_k_sse.c`	Narrow vector width, explicit packing and reduction care, parity-sensitive contracts.
AVX	`gemm_kernels_q4k_avx.c`, `quantize_row_q8_k_avx.c`	256-bit vectors without the full AVX2 / VNNI integer surface.
AVX2	`gemm_kernels_q4k_q8k_avx2.c`, `quantize_row_q8_k_avx2.c`	Better integer throughput and fused integer dot-product building blocks.
AVX-512	`gemm_kernels_q4k.c`, `gemm_kernels_q5_0.c`, `gemm_kernels_q8_0.c`, `quantize_row_q8_k_avx512.c`	16-lane FP32 work, cleaner reductions, denser byte unpack operations.
VNNI	`gemm_kernels_q4k_q8k_vnni.c`	Hardware byte-dot instructions collapse multiple INT8 products into one accumulating op.
AMX	`gemm_kernels_amx.c`	Tile registers and matrix-style INT8/BF16 execution for larger GEMM cases.
ARM NEON	`gemm_kernels_q8_0.c`, `gemm_kernels_q5_0.c`, `gemm_kernels_q6k_q8k.c`	128-bit vectors with widening multiplies and explicit reduction idioms, as shown in the ARM NEON kernel post.

Compile-time quantized dispatch hierarchy

text

AMX
  ↓
AVX-512 VNNI / AVX-512
  ↓
AVX2
  ↓
AVX
  ↓
SSE / SSSE3
  ↓
Scalar reference

ARM64 side:
  NEON where present
  scalar fallback otherwise

Quantized ISA inventory at a glance

text

SSE / SSSE3 tier:
  gemm_kernels_q4k_sse.c
  gemm_kernels_q5_0_sse.c
  gemm_kernels_q5_0_sse_v2.c
  gemm_kernels_q6k_sse.c
  quantize_row_q8_k_sse.c

AVX tier:
  gemm_kernels_q4k_avx.c
  quantize_row_q8_k_avx.c

AVX2 / AVX-512 / VNNI / AMX tier:
  gemm_kernels_q4k_q8k_avx2.c
  quantize_row_q8_k_avx2.c
  quantize_row_q8_k_avx512.c
  gemm_kernels_q4k_q8k_vnni.c
  gemm_kernels_amx.c

NEON Q6_K × Q8_K helper from src/kernels/gemm_kernels_q6k_q8k.c

static float dot_q6_k_q8_k_neon(const block_q6_K *w,
                                const block_q8_K *x,
                                int K)
{
    const int nb = K / QK_K;
    float sumf = 0.0f;

    for (int i = 0; i < nb; ++i) {
        const float d = GGML_FP16_TO_FP32(w[i].d) * x[i].d;
        ...
        int32x4_t acc = vdupq_n_s32(0);
        for (int j = 0; j < QK_K; j += 16) {
            const int8x16_t wv = vld1q_s8(&wvals[j]);
            const int8x16_t sv = vld1q_s8(&svals[j]);
            const int8x16_t xv = vld1q_s8(&q8[j]);
            ...
            acc = vaddq_s32(acc, p0);
            acc = vaddq_s32(acc, p1);
        }
    }
}

Seen from far enough away, the ISA tiers are all spelling the same sentence: decode bytes, apply scales, multiply, accumulate, reduce. What changes is how much width and how many fused operations the hardware offers for each stage. the SIMD deep dive established the x86 ladder. the ARM NEON kernel post showed that ARM is already on the board. This post adds the missing detail: the quantized ladder is concrete at every rung.

ISA coverage matrix for CKE quantized kernels across SSE, AVX, AVX2, AVX-512, VNNI, AMX, and ARM NEON.

VNNI — The INT8 Dot Product Instruction

VNNI is where modern Intel client and server cores start looking purpose-built for quantized inference. The crucial instruction is _mm256_dpbusd_epi32, which performs a byte dot product and accumulates four products into each 32-bit lane. Instead of manually unpacking, widening, and summing every byte pair, the hardware performs a large chunk of the inner-loop work directly.

CKE’s gemm_kernels_q4k_q8k_vnni.c is therefore a very explicit proof of ISA fluency. It names the instruction, arranges the bytes the way VNNI wants them, and uses the bsums side channel from Q8_K to handle the offset terms cleanly.

hsum256_epi32() from src/kernels/gemm_kernels_q4k_q8k_vnni.c

static inline int32_t hsum256_epi32(__m256i v) {
    __m128i lo = _mm256_castsi256_si128(v);
    __m128i hi = _mm256_extracti128_si256(v, 1);
    __m128i sum = _mm_add_epi32(lo, hi);
    sum = _mm_hadd_epi32(sum, sum);
    sum = _mm_hadd_epi32(sum, sum);
    return _mm_cvtsi128_si32(sum);
}

dot_q4_k_q8_k_32_vnni() from src/kernels/gemm_kernels_q4k_q8k_vnni.c

static inline int32_t dot_q4_k_q8_k_32_vnni(const uint8_t *q4_packed_32,
                                            const int8_t *q8_32,
                                            int high_nibble) {
    const __m256i packed = _mm256_loadu_si256((const __m256i *)q4_packed_32);
    const __m256i mask4 = _mm256_set1_epi8(0x0F);
    const __m256i q4_bytes = high_nibble
        ? _mm256_and_si256(_mm256_srli_epi16(packed, 4), mask4)
        : _mm256_and_si256(packed, mask4);
    const __m256i q8_bytes = _mm256_loadu_si256((const __m256i *)q8_32);
    __m256i acc = _mm256_setzero_si256();
    acc = _mm256_dpbusd_epi32(acc, q4_bytes, q8_bytes);
    return hsum256_epi32(acc);
}

VNNI block accumulation using bsums and dmin

for (int j = 0, is = 0, q_offset = 0; j < QK_K; j += 64, is += 2, q_offset += 32) {
    const uint8_t *qs = &w->qs[q_offset];
    const int8_t *q8_lo = &x->qs[j];
    const int8_t *q8_hi = &x->qs[j + 32];

    const int32_t sum_lo = dot_q4_k_q8_k_32_vnni(qs, q8_lo, 0);
    const int32_t sum_hi = dot_q4_k_q8_k_32_vnni(qs, q8_hi, 1);
    const int32_t bsum_lo = (int32_t)x->bsums[j / 16] +
                            (int32_t)x->bsums[j / 16 + 1];
    const int32_t bsum_hi = (int32_t)x->bsums[(j + 32) / 16] +
                            (int32_t)x->bsums[(j + 32) / 16 + 1];

    sumf += d * (float)sc[is] * (float)sum_lo;
    sumf -= dmin * (float)m_val[is] * (float)bsum_lo;
    sumf += d * (float)sc[is + 1] * (float)sum_hi;
    sumf -= dmin * (float)m_val[is + 1] * (float)bsum_hi;
}

Two things stand out. First, VNNI lets the inner loop become a sequence of hardware byte-dot operations instead of a home-grown multiply-and-reduce pipeline. Second, the Q8_K block-sum field exists precisely because the kernel wants to pay the offset cost once, not rediscover it expensively on every call. This is the kind of kernel a modern Intel reviewer wants to see: the runtime knows when the ISA offers a byte-dot primitive and restructures the math around it rather than pretending all vector extensions are equivalent.

VNNI does not make correctness easier by itself. CKE still keeps the fast path behind an explicit environment gate because changing accumulation order can move borderline logits. That is another sign of mature engineering: hardware enthusiasm checked by numerical discipline.

AMX — Matrix Tiles for Quantized GEMM

AMX is a different tier of hardware support entirely. Instead of simply widening SIMD registers, Intel adds tile registers and instructions that are much closer to dedicated matrix execution. In CKE, gemm_kernels_amx.c documents the model clearly: tile configuration, tile loads, tile dot products, and explicit runtime detection.

The most important practical nuance is where AMX matters. For single-token decode, quantized GEMV often dominates, and the K-quant formats still make the scalar or VNNI-friendly path the most natural implementation. For larger prefill-style matrix multiplies, however, AMX becomes a far more compelling execution surface.

__tile_config from src/kernels/gemm_kernels_amx.c

typedef struct __tile_config {
    uint8_t palette_id;
    uint8_t start_row;
    uint8_t reserved_0[14];
    uint16_t colsb[16];  /* Columns in bytes for each tile */
    uint8_t rows[16];    /* Rows for each tile */
} __tile_config;

configure_tiles_gemm() from src/kernels/gemm_kernels_amx.c

static void configure_tiles_gemm(int M, int N, int K) {
    __tile_config config = {0};

    config.palette_id = 1;

    int tile_m = (M > AMX_TILE_M) ? AMX_TILE_M : M;
    int tile_k = (K > AMX_TILE_K) ? AMX_TILE_K : K;
    int tile_n = (N > AMX_TILE_N) ? AMX_TILE_N : N;

    config.rows[TILE_A] = tile_m;
    config.colsb[TILE_A] = tile_k;

    config.rows[TILE_B] = tile_k;
    config.colsb[TILE_B] = tile_n * 4;

    config.rows[TILE_C] = tile_m;
    config.colsb[TILE_C] = tile_n * 4;

    _tile_loadconfig(&config);
}

gemm_amx_int8_core() inner loop from src/kernels/gemm_kernels_amx.c

for (int m = 0; m < M; m += AMX_TILE_M) {
    int tile_m = (m + AMX_TILE_M <= M) ? AMX_TILE_M : (M - m);

    for (int n = 0; n < N; n += AMX_TILE_N) {
        int tile_n = (n + AMX_TILE_N <= N) ? AMX_TILE_N : (N - n);

        _tile_zero(TILE_C);

        for (int k = 0; k < K; k += AMX_TILE_K) {
            int tile_k = (k + AMX_TILE_K <= K) ? AMX_TILE_K : (K - k);

            _tile_loadd(TILE_A, A + m * K + k, K);
            _tile_loadd(TILE_B, B + k * N + n, N * 4);
            _tile_dpbssd(TILE_C, TILE_A, TILE_B);
        }

        _tile_stored(TILE_C, C + m * N + n, N * 4);
    }
}

amx_available() from src/kernels/gemm_kernels_amx.c

bool amx_available(void) {
    unsigned int eax, ebx, ecx, edx;

    __asm__ __volatile__(
        "cpuid"
        : "=a"(eax), "=b"(ebx), "=c"(ecx), "=d"(edx)
        : "a"(7), "c"(0)
    );

    bool has_amx_tile = (edx >> 24) & 1;
    bool has_amx_int8 = (edx >> 25) & 1;

    unsigned int xcr0_lo = 0, xcr0_hi = 0;
    __asm__ __volatile__(
        ".byte 0x0f, 0x01, 0xd0"
        : "=a"(xcr0_lo), "=d"(xcr0_hi)
        : "c"(0)
    );
    uint64_t xcr0 = ((uint64_t)xcr0_hi << 32) | xcr0_lo;
    bool os_tile_state_enabled = (xcr0 & 0x60000) == 0x60000;

    return has_amx_tile && has_amx_int8 && os_tile_state_enabled;
}

The AMX file is valuable even where it currently falls back for some K-quant cases. It proves the runtime is architected for matrix extensions rather than only for traditional vector ISAs. The engineering question becomes “which data layouts deserve native AMX paths next?” rather than “can the runtime even see AMX?” 8 tiles AMX exposes a different optimization surface from AVX-512: tile registers, tile loads, and dedicated dot-product instructions for larger GEMM shapes.

In short: VNNI is the elite byte-dot vector path; AMX is the matrix-engine path. A serious CPU inference runtime should understand both. CKE visibly does.

The Q8_K SSE Parity Bug — A War Story

The most revealing engineering stories are often the bugs that slip past casual testing. On 2026-03-09, CKE fixed a subtle Q8_K SSE parity bug in quantize_row_q8_k_sse.c. The root issue was not a dramatic crash. It was a contract mismatch: signed-max selection and bsums behavior no longer exactly matched the llama.cpp-style reference path.

That kind of bug is pernicious because outputs can still look mostly normal. A short text-generation sanity check might pass. But once the model repeatedly crosses a mixed-quant boundary like quantize_row_q8_k -> gemv_q4_k_q8_k, tiny mismatches can accumulate into parity drift.

In this case, the bug only surfaced during Nanbeige model bring-up with strict per-op checks. That is exactly the lesson silicon teams should take seriously: quantization bugs often survive demos and only fail under the discipline of exact or near-exact parity validation.

Signed-max contract in quantize_row_q8_k_sse.c

/* Keep the exact signed-max selection contract from llama.cpp/ref. */
float max = 0.0f;
float amax = 0.0f;
for (int j = 0; j < QK_K; ++j) {
    const float xv = x[j];
    const float ax = fabsf(xv);
    if (ax > amax) {
        amax = ax;
        max = xv;
    }
}

if (amax == 0.0f) {
    y[i].d = 0.0f;
    memset(y[i].qs, 0, sizeof(y[i].qs));
    memset(y[i].bsums, 0, sizeof(y[i].bsums));
    x += QK_K;
    continue;
}

SSE quantization and bsums contract from quantize_row_q8_k_sse.c

const float iscale = -127.0f / max;
...
for (int j = 0; j < QK_K; j += 16) {
    const __m128 x0 = _mm_loadu_ps(x + j + 0);
    const __m128 x1 = _mm_loadu_ps(x + j + 4);
    const __m128 x2 = _mm_loadu_ps(x + j + 8);
    const __m128 x3 = _mm_loadu_ps(x + j + 12);
    ...
    const __m128i q0123 = _mm_packs_epi16(q01, q23);
    _mm_storeu_si128((__m128i *)(y[i].qs + j), q0123);

    int sum = 0;
    for (int ii = 0; ii < 16; ++ii) {
        sum += y[i].qs[j + ii];
    }
    y[i].bsums[j / 16] = (int16_t)sum;
}

y[i].d = 1.0f / iscale;

What the March 2026 fix really restored

text

Bug class:
  wrong signed-max selection
  wrong Q8_K bsums contract

Observed symptom:
  outputs still looked plausible
  mixed-quant parity drift accumulated slowly

Why it finally surfaced:
  Nanbeige bring-up
  strict parity checking
  mixed-quant boundary exercised repeatedly

Reference fix:
  commit 224a4d30

The lesson is not merely “bugs happen.” The lesson is that quantization demands a higher standard of validation than casual end-to-end smoke tests. The output distribution can remain superficially sane while the internal contract is already broken. The parity bug is strong evidence that CKE is operating at the right engineering depth. Only teams living close to the byte contract ever discover bugs this subtle—and only disciplined parity workflows catch them reliably.

Q8_K SSE parity bug story showing how a subtle signed-max and block-sum mismatch created mixed-quant drift until strict bring-up caught it.

That is why CKE’s kernel rules matter: strict parity fallbacks are not there for decoration. They are how the team proves each ISA-specific implementation still preserves the intended contract.

Cache Line Geometry

Quantization is often explained with compression ratios, but cache geometry is the more revealing lens for CPU reviewers. The question is not just how many bits a weight occupies on disk. It is how many useful weights arrive per cache-line fetch once the kernel starts streaming through blocks.

Q4_0 is the easy example: 18 bytes per block means a 64-byte line holds a little more than 3.5 blocks, or about 112 weights worth of payload plus scales. Q4_K is larger and more structured: 144 bytes per super-block, which spans 2.25 cache lines. That is still attractive because one fetch bundle brings in 256 weights and their local metadata together.

This is why co-location matters. If the scales lived in one array and the packed weights lived somewhere else, the bandwidth win would erode immediately under extra cache misses and weaker prefetch behavior.

Cache-line arithmetic for common CKE formats

text

Q4_0:
  18 bytes / block
  64 / 18 ≈ 3.55 blocks / line
  ≈ 112 weights / cache line fetch

Q8_0:
  34 bytes / block
  64 / 34 ≈ 1.88 blocks / line
  ≈ 60 weights / cache line fetch

Q4_K:
  144 bytes / super-block
  144 / 64 = 2.25 cache lines / block
  256 weights arrive with nested metadata

Why metadata must be adjacent to weights

text

good layout:
  [scale | packed weights | next block]

bad layout:
  [all scales elsewhere]
  [all packed weights elsewhere]

Reason:
  the kernel wants one streaming access pattern
  not two unrelated pointer chases

Format	Bytes per block	Weights per block	Cache implication
`Q4_0`	18	32	Excellent density; tiny metadata overhead.
`Q8_0`	34	32	Higher fidelity, lower line density.
`Q4_K`	144	256	Super-block spans multiple lines but amortizes metadata well.
`Q8_K`	292	256	Larger activation block, but tailored to mixed-quant compute rather than pure storage density.

There is also a neat mapping between block sizes and SIMD width. A 32-weight Q4_0 block lines up naturally with AVX-512 as two 16-lane FP32 chunks. A 256-weight K-quant block lines up naturally with loop nests that process 64-value quarters or 16-value sub-groups, depending on the ISA. 16 FP32 lanes AVX-512 turns a 32-weight simple block into two clean passes, which is one reason quantized decode loops often look structurally elegant on wide x86 tiers.

Bump Allocator Integration

CKE’s quantization docs make another systems choice explicit: quantized weights, FP32 activations, and scratch space live in separate bump-allocator regions. That sounds mundane until you realize how many runtime bugs and performance regressions it avoids.

Homogeneous regions improve cache behavior and alignment policy. They also prevent type confusion by construction. A pointer to block_q4_0 or block_q4_K does not share storage with live FP32 activations, and the scratch space used by temporary compute stages does not contaminate the read-only weight region.

Bump allocator layout from the CKE docs

typedef struct {
    uint8_t *weights_q4;   // Region 1: quantized weights
    size_t weights_size;

    float *activations;    // Region 2: FP32 activations
    size_t act_size;

    float *scratch;        // Region 3: temporary workspace
    size_t scratch_size;

    float *dequant_cache;  // Optional hot dequant cache
    size_t cache_size;
} BumpAllocator;

Allocation strategy from the docs

size_t num_blocks = (num_weights + 31) / 32;
size_t q4_0_size = num_blocks * sizeof(block_q4_0);

void *weights = bump_alloc(allocator, q4_0_size, REGION_WEIGHTS);

size_t act_size = batch * seq_len * hidden_dim * sizeof(float);
float *activations = bump_alloc(allocator, act_size, REGION_ACTIVATIONS);

Region model discussed in this post

text

Region 1:
  quantized weights
  read-only after model load

Region 2:
  FP32 activations / hidden states
  read-write during inference

Region 3:
  scratch / temporary compute buffers
  ephemeral per op or per stage

ck_dtype_block_size() from include/ckernel_dtype.h

static inline size_t ck_dtype_block_size(CKDataType dt)
{
    switch (dt) {
    case CK_DT_Q4_0:
    case CK_DT_Q4_1:
    case CK_DT_Q5_0:
    case CK_DT_Q5_1:
    case CK_DT_Q8_0:
        return 32;
    case CK_DT_Q4_K:
    case CK_DT_Q5_K:
    case CK_DT_Q6_K:
    case CK_DT_Q8_K:
        return 256;
    default:
        return 1;
    }
}

ck_dtype_block_bytes() from include/ckernel_dtype.h

static inline size_t ck_dtype_block_bytes(CKDataType dt)
{
    switch (dt) {
    case CK_DT_Q4_0:
        return 18;
    case CK_DT_Q4_1:
        return 20;
    case CK_DT_Q5_0:
        return 22;
    case CK_DT_Q5_1:
        return 24;
    case CK_DT_Q4_K:
        return 144;
    case CK_DT_Q6_K:
        return 210;
    case CK_DT_Q8_0:
        return 34;
    case CK_DT_Q8_K:
        return 292;
    default:
        return 0;
    }
}

ck_dtype_is_quantized() from include/ckernel_dtype.h

static inline int ck_dtype_is_quantized(CKDataType dt)
{
    return dt == CK_DT_Q4_0 || dt == CK_DT_Q4_1 || dt == CK_DT_Q5_0 || dt == CK_DT_Q5_1 ||
           dt == CK_DT_Q5_K || dt == CK_DT_Q4_K || dt == CK_DT_Q6_K || dt == CK_DT_Q8_0 || dt == CK_DT_Q8_K;
}

This allocator design is the quiet partner of the kernel work. Quantized kernels perform best when their memory neighborhoods are predictable. Keeping regions homogeneous helps hardware prefetchers, simplifies alignment reasoning, and makes the dtype dispatch story materially safer. A runtime that cares about byte-level quantization but is sloppy about memory regions is only doing half the job. CKE’s allocator model shows the other half.

This is also where the format structs become operationally useful. Allocation is not “reserve some bytes and hope.” It is num_blocks × sizeof(block_q4_0), num_blocks × sizeof(block_q4_K), and so on. The type system and the storage plan reinforce each other.

Conclusion — 32 Files, 11,652 Lines, 8 Formats

The quantization story in CKE is not a feature flag and not a library dependency papered over with wrapper code. It is a dedicated kernel surface: simple formats, K-quants, dequantizers, activation quantizers, fused quantized normalization and attention code, and dispatch rules that explicitly target the ISA tiers modern CPU vendors care about.

The most important production path today is the mixed one: Q4_K or Q6_K weights multiplied against Q8_K activations produced on the fly from FP32 hidden state. That path explains why Q8_K needs FP32 scale precision, why bsums exist, why VNNI is such a natural match, and why the Q8_K SSE parity fix mattered so much.

The repository depth is the real message. Scalar through AVX-512, VNNI, and AMX on x86. Existing NEON paths on ARM. Register-level dequantization. Byte-level packing. Strict parity fallbacks. And a bug history subtle enough to prove the team is working at the right abstraction layer.

Where to go next

Read the official CKE quantization deep dive, the byte-level format reference, and the broader scaling thesis.

Then inspect the source directly at github.com/antshiv/C-Kernel-Engine, including the March 2026 Q8_K SSE parity fix, and pair this post with the SIMD deep dive and the ARM NEON kernel post for the full ISA-level picture.

Engineering summary

text

CKE quantization story:
  8 headline formats in docs
  mixed-quant runtime path in production
  DType-based dispatch outside the hot loop
  dequantize in registers, not RAM
  ISA tiers from SSE through AVX-512 VNNI and AMX
  ARM NEON coverage already live
  strict parity matters enough to catch subtle Q8_K bugs

Quantization Deep Dive: How CPU Kernels Compress Weights and Preserve Accuracy

What this post covers

Introduction — Why Quantization Is the CPU Story

Block Quantization Fundamentals

The Format Landscape — 8 Formats in CKE

Q4_0 — The Simplest Format

Q4_K — K-Quant with Nested Scales

Q8_K — The Activation-Side Bridge

The Mixed-Quant Runtime Path

Dequantization in Registers

ISA Tiers for Quantized Kernels

VNNI — The INT8 Dot Product Instruction

AMX — Matrix Tiles for Quantized GEMM

The Q8_K SSE Parity Bug — A War Story

Cache Line Geometry

Bump Allocator Integration

Conclusion — 32 Files, 11,652 Lines, 8 Formats

Where to go next

ShivasNotes

Explore

Connect

Quantization Deep Dive: How CPU Kernels Compress Weights and Preserve Accuracy

What this post covers

Introduction — Why Quantization Is the CPU Story

Block Quantization Fundamentals

The Format Landscape — 8 Formats in CKE

Q4_0 — The Simplest Format

Q4_K — K-Quant with Nested Scales

Q8_K — The Activation-Side Bridge

The Mixed-Quant Runtime Path

Dequantization in Registers

ISA Tiers for Quantized Kernels

VNNI — The INT8 Dot Product Instruction

AMX — Matrix Tiles for Quantized GEMM

The Q8_K SSE Parity Bug — A War Story

Cache Line Geometry

Bump Allocator Integration

Conclusion — 32 Files, 11,652 Lines, 8 Formats

Where to go next

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect