Quantization deep dive · ML fundamentals
This ShivasNotes deep dive is written for CPU silicon teams evaluating whether C-Kernel-Engine really owns its quantized inference path. The answer is yes: quantization in C-Kernel-Engine is handwritten kernel code, documented at the quantization deep dive and the format reference, with the walkthrough companion on youtube.com/@antshivrobotics@shivasnotes.
The thesis of this post is simple: quantization is the CPU story. If inference is memory-bound, then the decisive engineering question is not just “how many FLOPs can the core issue?” but “how aggressively can the runtime compress weights, preserve accuracy, and keep dequantized values inside registers instead of memory?” For a silicon team, this is the tell: CKE does not wave at quantization as a checkbox. It names the block format, the byte packing, the dispatch rule, the ISA tier, and even the parity bug that only showed up under real model bring-up.
What this post covers
First, the post frames quantization as the practical way CPUs turn large models from impossible to deployable. Then it walks the actual CKE formats and the mixed-quant runtime path that matters in production.
The second half moves from formats to engineering reality: dequantization in registers, ISA-specific kernels from SSE through AVX-512 VNNI and AMX, the March 2026 Q8_K SSE parity fix, cache-line geometry, allocator integration, and what these choices signal to ARM, Intel, and AMD.
Introduction — Why Quantization Is the CPU Story
AI inference on CPU is usually memory-bound long before it is arithmetic-bound. At batch size 1, decode looks like repeated matrix-vector multiplies with low arithmetic intensity. That means the system wins when it moves fewer bytes, not when it merely advertises higher peak FLOPs.
That is why quantization is central rather than optional. A 7B model in FP32 weighs roughly 28 GB. Put the same model into a Q4_K-style footprint and the memory story drops to roughly 3.5 GB. That is the difference between “does not fit in the practical deployment box” and “runs in one coherent CPU memory space.”
CKE’s quantization story is therefore not “we call a vendor library.” It is a kernel surface: dedicated per-format GEMV and GEMM files, dequantization helpers, activation quantizers, fused quantized attention pieces, and ISA-tiered implementations that exist because the runtime owns the layout contract end to end.
That ownership matters for silicon review. When a runtime fully owns quantization, the evaluator can see exactly where SSE ends, where AVX2 begins, where VNNI changes the inner loop, where AMX becomes relevant, and where correctness is protected by strict parity fallbacks. 32 files The public framing for this post is roughly 32 quantization-focused kernel files and 11.6k+ lines of handwritten C across scalar, SSE, AVX, AVX2, AVX-512, VNNI, AMX, and ARM NEON paths.

7B parameters:
FP32 -> 7e9 × 4 bytes ≈ 28 GB
INT8 -> 7e9 × 1 byte ≈ 7 GB
INT4 -> 7e9 × 0.5 byte ≈ 3.5 GB
Decode reality:
arithmetic intensity is low
memory traffic dominates
fewer bytes moved = faster practical inferenceFormat kernels:
q4_0, q4_1, q5_0, q5_1, q5_k, q6_k, q8_0, q4_k, q8_k
Support kernels:
dequant_kernels.c
quantize_row_q8_k_*.c
rmsnorm_kernels_int4.c
rmsnorm_kernels_int8.c
ISA-specialized paths:
SSE / SSSE3
AVX
AVX2
AVX-512
AVX-512 VNNI
AMX
ARM NEONPer-format kernels:
gemm_kernels_q4_0.c
gemm_kernels_q4_1.c
gemm_kernels_q5_0.c
gemm_kernels_q5_1.c
gemm_kernels_q5_k.c
gemm_kernels_q6k.c
gemm_kernels_q8_0.c
gemm_kernels_q4k.c
gemm_kernels_q4k_q8k.c
gemm_kernels_q6k_q8k.c
Dispatch helpers and specialization files:
gemm_kernels_q4k_sse.c
gemm_kernels_q4k_avx.c
gemm_kernels_q4k_q8k_avx2.c
gemm_kernels_q4k_q8k_vnni.c
gemm_kernels_amx.c
quantize_row_q8_k_sse.c
quantize_row_q8_k_avx.c
quantize_row_q8_k_avx2.c
quantize_row_q8_k_avx512.c352 gemm_kernels_q4_0.c
304 gemm_kernels_q4_1.c
1753 gemm_kernels_q5_0.c
369 gemm_kernels_q5_1.c
687 gemm_kernels_q5_k.c
736 gemm_kernels_q4k.c
400 gemm_kernels_q4k_q8k.c
1228 gemm_kernels_q6k_q8k.c
275 gemm_kernels_amx.c
100 quantize_row_q8_k_sse.c
542 dequant_kernels.c
140 rmsnorm_kernels_int4.c
123 rmsnorm_kernels_int8.cThe rest of this post treats quantization as a systems problem rather than a compression trick. The key questions are: how are bytes packed, how are scales represented, how does dispatch select the correct hot loop, and what kinds of bugs only show up when strict per-op parity is enforced.
Block Quantization Fundamentals
Single-scale quantization sounds simple: find the maximum absolute value, map the whole tensor into an integer range, and dequantize later by multiplying with one scale. For neural weights, that often fails because one large outlier forces the step size high enough that the small but important values collapse to zero.
Block quantization fixes that by grouping weights into small blocks and giving each block its own scale. In CKE’s simple formats that usually means 32 weights per block. In the K-quants it means 256-weight super-blocks with nested sub-scales.
This also explains the split between symmetric and asymmetric formats. Symmetric quantization assumes values cluster around zero and stores only a scale. Asymmetric quantization stores both a scale and an offset or minimum, which makes it more expensive in bytes but more accurate for distributions that are not centered.
The engineering trade is always the same: smaller groups improve local fidelity, but every group needs metadata. That metadata is why block layout is as important as bit width. Compression only helps if the scale overhead stays small and the kernel can decode it cheaply. A good quantization format does two things at once: it protects the model from catastrophic information loss and it stays friendly to the CPU’s load, unpack, multiply, and reduce instructions.
| Concept | What it means | Why it matters in kernels |
|---|---|---|
| Single scale | One scale for a large tensor or row. | Cheap metadata, but small weights often quantize to zero. |
| Per-block scale | One scale for each 32-weight or 256-weight block. | Preserves local detail with manageable overhead. |
| Symmetric | Centered around zero, store scale only. | Simple decode, ideal for Q4_0 / Q5_0 / Q8_0 style paths. |
| Asymmetric | Store scale and offset / minimum. | Better fit for skewed ranges, but decode has an extra term. |
| K-quant | Super-block plus nested sub-scales. | Higher compression efficiency without per-32 FP16 overhead. |
Quantize:
scale = max(abs(weights)) / (2^(bits-1) - 1)
q = round(weight / scale)
Dequantize:
weight ≈ q × scale
Example intuition:
INT4 symmetric range ≈ [-8, 7]
INT8 symmetric range ≈ [-127, 127]Quantize:
scale = (max - min) / (2^bits - 1)
q = round((weight - min) / scale)
Dequantize:
weight ≈ q × scale + min
CKE examples:
Q4_1 -> q × d + m
Q5_1 -> q × d + mSingle-scale failure:
large outlier raises scale
tiny weights round to 0
Per-block grouping:
each block gets local scale
tiny weights can still occupy multiple integer levels
metadata cost stays bounded
the SIMD deep dive framed the SIMD question as “same math, wider registers, fewer instructions.” Quantization adds a new layer: same math, fewer bytes, more careful unpacking. the ARM NEON kernel post showed the ARM version of this same story. Here the focus is the format contract itself.
The Format Landscape — 8 Formats in CKE
The public CKE quantization docs currently center eight formats: five simple 32-weight formats and three 256-weight K-quant formats. Those are the formats a silicon reviewer should understand first because they define the on-disk layout, the dequant math, and the runtime dispatch surface.
There is one subtle repository detail worth calling out. The kernel tree also contains gemm_kernels_q5_k.c. In other words, the compute surface already services an additional K-quant tier beyond the eight-format headline table. The point of this post, though, is the documented core set and the mixed-quant path that drives production inference.
| Format | Bits / weight | Block size | Structure | Typical role |
|---|---|---|---|---|
Q4_0 | 4.5 | 32 | 2-byte FP16 scale + 16 packed bytes | Simplest symmetric INT4 weight path. |
Q4_1 | 5.0 | 32 | FP16 scale + FP16 min + 16 packed bytes | Asymmetric INT4 weights. |
Q5_0 | 5.5 | 32 | FP16 scale + 4-byte high-bit field + 16 packed bytes | 5-bit symmetric weights. |
Q5_1 | 6.0 | 32 | Scale + min + high-bit field + packed bytes | 5-bit asymmetric weights. |
Q8_0 | 8.5 | 32 | FP16 scale + 32 signed bytes | High-fidelity simple quant format. |
Q4_K | 4.5 | 256 | FP16 d, FP16 dmin, 12-byte packed scales, 128 packed bytes | Primary compression-first K-quant weight format. |
Q6_K | 6.5625 | 256 | 128-byte low bits + 64-byte high bits + 16 int8 scales + FP16 | Higher-fidelity K-quant weights. |
Q8_K | 9.125 | 256 | FP32 scale + 256 int8 values + 16 block sums | Activation-side bridge for mixed K-quant kernels. |
block_q4_0 from include/ckernel_quant.h#define QK4_0 32
typedef struct {
ck_half d; /* 2 bytes: scale (delta) */
uint8_t qs[QK4_0 / 2]; /* 16 bytes: 32 x 4-bit weights (2 per byte) */
} block_q4_0;
/* Total: 18 bytes per 32 weights */block_q8_0 from include/ckernel_quant.h#define QK8_0 32
typedef struct {
ck_half d; /* 2 bytes: scale */
int8_t qs[QK8_0]; /* 32 bytes: 32 x 8-bit signed weights */
} block_q8_0;
/* Total: 34 bytes per 32 weights */block_q5_0 from include/ckernel_quant.h#define QK5_0 32
typedef struct {
ck_half d; /* 2 bytes: scale (delta) */
uint8_t qh[4]; /* 4 bytes: high 1-bit of each weight (32 bits total) */
uint8_t qs[QK5_0 / 2]; /* 16 bytes: low 4-bits of 32 weights (2 per byte) */
} block_q5_0;
/* Total: 22 bytes per 32 weights */block_q5_1 from include/ckernel_quant.h#define QK5_1 32
typedef struct {
ck_half d; /* 2 bytes: scale (delta) */
ck_half m; /* 2 bytes: minimum */
uint8_t qh[4]; /* 4 bytes: high 1-bit of each weight (32 bits total) */
uint8_t qs[QK5_1 / 2]; /* 16 bytes: low 4-bits of 32 weights (2 per byte) */
} block_q5_1;
/* Total: 24 bytes per 32 weights */block_q6_K from include/ckernel_quant.htypedef struct {
uint8_t ql[QK_K / 2]; /* 128 bytes: low 4 bits */
uint8_t qh[QK_K / 4]; /* 64 bytes: high 2 bits */
int8_t scales[QK_K / 16]; /* 16 bytes: 16 sub-block scales */
ck_half d; /* 2 bytes: super-block scale */
} block_q6_K;
/* Total: 210 bytes per 256 weights */block_q4_K from include/ckernel_quant.h#define QK_K 256
#define K_SCALE_SIZE 12
typedef struct {
ck_half d; /* 2 bytes: super-block scale */
ck_half dmin; /* 2 bytes: super-block minimum */
uint8_t scales[K_SCALE_SIZE]; /* 12 bytes: 8 sub-block scales + 8 sub-block mins (6-bit packed) */
uint8_t qs[QK_K / 2]; /* 128 bytes: 256 x 4-bit weights */
} block_q4_K;
/* Total: 144 bytes per 256 weights */block_q8_K from include/ckernel_quant.htypedef struct {
float d; /* 4 bytes: scale */
int8_t qs[QK_K]; /* 256 bytes: 256 x 8-bit signed weights */
int16_t bsums[QK_K / 16]; /* 32 bytes: block sums for optimization */
} block_q8_K;
/* Total: 292 bytes per 256 weights */dequant_q5_0_block() from src/kernels/dequant_kernels.cvoid dequant_q5_0_block(const block_q5_0 *block, float *output)
{
const float d = GGML_FP16_TO_FP32(block->d);
uint32_t qh;
memcpy(&qh, block->qh, sizeof(qh));
for (int j = 0; j < QK5_0 / 2; j++) {
const uint8_t packed = block->qs[j];
const int lo = (packed & 0x0F);
const int hi = (packed >> 4);
const int xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
const int xh_1 = ((qh >> (j + 12))) & 0x10;
const int q0 = (lo | xh_0) - 16;
const int q1 = (hi | xh_1) - 16;
output[j] = d * (float)q0;
output[j + 16] = d * (float)q1;
}
}dequant_q6_k_block() from src/kernels/dequant_kernels.cvoid dequant_q6_k_block(const block_q6_K *block, float *output)
{
const float d = GGML_FP16_TO_FP32(block->d);
const uint8_t *ql = block->ql;
const uint8_t *qh = block->qh;
const int8_t *sc = block->scales;
float *y = output;
for (int n = 0; n < QK_K; n += 128) {
for (int l = 0; l < 32; ++l) {
const int is = l / 16;
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
y[l + 0] = d * (float)sc[is + 0] * (float)q1;
y[l + 32] = d * (float)sc[is + 2] * (float)q2;
y[l + 64] = d * (float)sc[is + 4] * (float)q3;
y[l + 96] = d * (float)sc[is + 6] * (float)q4;
}
y += 128;
ql += 64;
qh += 32;
sc += 8;
}
} The most important format-level distinction is that Q8_K uses an FP32 scale, not FP16. That is not cosmetic. It tells you CKE treats the activation bridge with extra numerical care because mixed-quant boundaries amplify small contract mismatches. 292 B Q8_K spends more bytes than a simple 8-bit block because it is not just a storage format; it is a compute contract for the mixed K-quant path.

Q4_0 — The Simplest Format
Q4_0 is the cleanest starting point because it makes the packing logic obvious. One FP16 scale. Sixteen data bytes. Two 4-bit weights per byte. Each nibble becomes a signed integer in the range [-8, 7], multiplied by the block scale.
The key observation is that the physical byte order is not the logical element order a newcomer might assume. In CKE’s dequantizer, the lower nibble maps to one half of the block and the upper nibble maps to the other half. That choice keeps the layout compatible with GGML-style blocks and tells the kernel exactly how to unpack.
dequant_q4_0_block() from src/kernels/dequant_kernels.cvoid dequant_q4_0_block(const block_q4_0 *block, float *output)
{
const float d = GGML_FP16_TO_FP32(block->d);
for (int i = 0; i < QK4_0 / 2; i++) {
const uint8_t packed = block->qs[i];
/* Lower nibble: elements 0..15 */
const int8_t q0 = (packed & 0x0F) - 8;
/* Upper nibble: elements 16..31 */
const int8_t q1 = (packed >> 4) - 8;
output[i] = d * (float)q0;
output[i + QK4_0 / 2] = d * (float)q1;
}
}block_q4_0 = 18 bytes total
bytes 0..1 : d (FP16 scale)
bytes 2..17 : qs[16] (32 weights packed as 4-bit nibbles)
Each qs[i] byte:
low nibble -> output[i]
high nibble -> output[i + 16]Q4_0 block size = 18 bytes
64-byte cache line / 18 bytes ≈ 3.55 blocks
3.5 blocks × 32 weights ≈ 112 weights per line fetch
Interpretation:
tiny metadata overhead
dense weight packing
strong bandwidth behavior for decode-style accessThis is exactly why 4-bit quantization is so compelling on CPU. Even before any SIMD enters the frame, the memory subsystem gets a radically smaller working set. Then the kernel engineer’s task becomes unpacking those nibbles with minimal instruction overhead. 18 bytes That is the entire storage cost for 32 weights, including the scale metadata. The bandwidth story is already visible before a single FMA executes.

the SIMD deep dive focused on how wider registers reduce the instruction count for the same scalar math. Q4_0 adds a new lesson: quantized kernels begin with representation engineering. Before the vector multiply, the runtime has to know exactly which nibble means which weight.
Q4_K — K-Quant with Nested Scales
Q4_K is where quantization stops looking elementary and starts looking like serious runtime engineering. A single 256-weight super-block carries two FP16 values, a 12-byte packed metadata field, and 128 bytes of 4-bit data. The metadata does not merely hold one scale. It holds the ingredients for eight sub-block scales and eight sub-block minima.
That extra complexity exists for a reason. The K-quant idea is to preserve local fidelity without paying the metadata cost of one FP16 scale per 32-weight block. It compresses better than the simple formats while still giving the kernel enough local information to reconstruct usable floating-point values.
The common bug source is the sign on the minimum term. In Q4_K, the correct formula is weight = q × (d × sc) - (dmin × mn). The minus is not negotiable. Treating dmin like an additive bias yields numerically wrong outputs that can look superficially plausible until strict parity catches them.
unpack_q4_k_scales() from include/ckernel_quant.hstatic inline void unpack_q4_k_scales(const uint8_t *scales,
uint8_t *sc, uint8_t *m) {
sc[0] = scales[0] & 0x3F;
sc[1] = scales[1] & 0x3F;
sc[2] = scales[2] & 0x3F;
sc[3] = scales[3] & 0x3F;
m[0] = scales[4] & 0x3F;
m[1] = scales[5] & 0x3F;
m[2] = scales[6] & 0x3F;
m[3] = scales[7] & 0x3F;
sc[4] = (scales[8] & 0x0F) | ((scales[0] >> 6) << 4);
sc[5] = (scales[9] & 0x0F) | ((scales[1] >> 6) << 4);
sc[6] = (scales[10] & 0x0F) | ((scales[2] >> 6) << 4);
sc[7] = (scales[11] & 0x0F) | ((scales[3] >> 6) << 4);
m[4] = (scales[8] >> 4) | ((scales[4] >> 6) << 4);
m[5] = (scales[9] >> 4) | ((scales[5] >> 6) << 4);
m[6] = (scales[10] >> 4) | ((scales[6] >> 6) << 4);
m[7] = (scales[11] >> 4) | ((scales[7] >> 6) << 4);
}dequant_q4_k_block() from src/kernels/dequant_kernels.cvoid dequant_q4_k_block(const block_q4_K *block, float *output)
{
const float d = GGML_FP16_TO_FP32(block->d);
const float dmin = GGML_FP16_TO_FP32(block->dmin);
uint8_t sc[8], m[8];
unpack_q4_k_scales(block->scales, sc, m);
for (int iter = 0; iter < 4; iter++) {
const float d1 = d * (float)sc[2 * iter];
const float m1 = dmin * (float)m[2 * iter];
const float d2 = d * (float)sc[2 * iter + 1];
const float m2 = dmin * (float)m[2 * iter + 1];
const uint8_t *qs = &block->qs[iter * 32];
float *out = &output[iter * 64];
for (int l = 0; l < 32; l++) {
const int q = (qs[l] & 0x0F);
out[l] = d1 * (float)q - m1;
}
for (int l = 0; l < 32; l++) {
const int q = (qs[l] >> 4);
out[32 + l] = d2 * (float)q - m2;
}
}
}256-weight super-block
d : FP16 super-scale
dmin : FP16 super-min scale
scales : packed 6-bit sub-scale + sub-min fields
qs : 128 bytes of packed 4-bit values
Decode rule per sub-block:
weight = q × (d × sc) - (dmin × mn) This is the point where quantization becomes a real kernel-design problem. The decode is no longer just “extract nibble, subtract center, multiply by scale.” It is metadata unpacking, sub-block bookkeeping, and then an expression whose sign matters enough to destroy parity if implemented loosely. If you remember only one formula from this section, remember the minus sign. Q4_K subtracts dmin × mn. Adding it is a classic wrong-output trap.

K-quants exist because the simple formats leave quality on the table at the same nominal bit width. The engineering win is not just size reduction. It is better compression without surrendering the local dynamic range that modern LLM layers need.
Q8_K — The Activation-Side Bridge
The most common mistake in reading a quantized runtime is assuming every format is just “another way to store weights.” Q8_K is more important than that. In CKE’s current mixed-quant path, it is the activation-side bridge between FP32 hidden state and K-quant weight kernels.
That explains two unusual decisions. First, the scale is FP32, not FP16. Second, the struct carries bsums: precomputed sums of each 16-value chunk. Those block sums are not decorative metadata; they exist so kernels such as the VNNI path can cheaply handle offset terms without rescanning the activation bytes.
block_q8_K and its intent from include/ckernel_quant.h/* Q8_K: K-Quant 8-bit (used for activations in some ops)
* - 256 weights per super-block
* - 1 FP32 scale per block (not FP16 like others!)
*/
typedef struct {
float d; /* 4 bytes: scale */
int8_t qs[QK_K]; /* 256 bytes: 256 x 8-bit signed weights */
int16_t bsums[QK_K / 16]; /* 32 bytes: block sums for optimization */
} block_q8_K;quantize_row_q8_k_sse() block-sum writeback from src/kernels/quantize_row_q8_k_sse.cfor (int j = 0; j < QK_K; j += 16) {
...
_mm_storeu_si128((__m128i *)(y[i].qs + j), q0123);
int sum = 0;
for (int ii = 0; ii < 16; ++ii) {
sum += y[i].qs[j + ii];
}
y[i].bsums[j / 16] = (int16_t)sum;
}
y[i].d = 1.0f / iscale;FP32 hidden state
-> quantize_row_q8_k
-> Q8_K activation block
-> gemv_q4_k_q8_k or gemv_q6_k_q8_k
Q8_K is not merely "8-bit weights".
It is the activation contract that mixed K-quant kernels expect. This is the format that makes the modern runtime path possible. The weights remain compressed in Q4_K or Q6_K, while each token’s FP32 hidden state is quantized on the fly into a layout whose scale precision and block-sum side channel are tailored to the downstream kernel. FP32 scale That extra precision is one reason the mixed-quant boundary can stay numerically well-behaved—until a parity bug breaks the contract, as the Q8_K parity section shows.
If the ARM NEON kernel post showed that ARM NEON already participates in quantized inference, Q8_K explains why the mixed path is portable across ISA tiers. The activation contract is fixed; only the way each architecture consumes it changes.
The Mixed-Quant Runtime Path
The production runtime story is not “everything is statically quantized forever.” Instead, CKE holds the hidden state in FP32, quantizes that activation row into Q8_K, and then dispatches into the matching K-quant kernel for the weights. This is the modern mixed-quant path that matters for v7-style inference.
The central systems idea is separation of concerns. Tensors know their DType. Dispatch resolves the correct kernel before the hot loop begins. The kernel then runs against homogeneous, already-understood layouts instead of wasting cycles on type checks inside the inner loop.
typedef enum {
DTYPE_FP32,
DTYPE_FP16,
DTYPE_BF16,
DTYPE_Q4_0,
DTYPE_Q4_1,
DTYPE_Q5_0,
DTYPE_Q5_1,
DTYPE_Q4_K,
DTYPE_Q6_K,
DTYPE_Q8_0,
DTYPE_Q8_K
} DType;select_linear_kernel() from the CKE quantization docsconst char *select_linear_kernel(DType weight, DType activation, bool prefill) {
if (!prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemv_q4_k_q8_k";
if (!prefill && weight == DTYPE_Q6_K && activation == DTYPE_Q8_K) return "gemv_q6_k_q8_k";
if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemm_nt_q4_k_q8_k";
if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_FP32) return "gemm_nt_q4_k";
if (!prefill && weight == DTYPE_Q8_0 && activation == DTYPE_FP32) return "gemv_q8_0";
return "fallback_or_error";
}void lower_mixed_quant_linear(...) {
quantize_row_q8_k(hidden_fp32, hidden_q8k, K);
gemv_q4_k_q8_k(out_fp32, weight_q4k, hidden_q8k, K);
}The important phrase here is type checking at tensor level, never in the hot path. This is the same software discipline good compiler backends follow: decide the representation early, then let the inner loop focus exclusively on arithmetic and memory movement. For CPU vendors, dispatch clarity matters almost as much as kernel quality. If you cannot tell which loop is hot for which dtype pair, you cannot reason about why a given ISA extension helps.

The mixed path also explains why bugs at the quantization boundary are so dangerous. A mismatch in signed-max selection or block-sum semantics can survive casual generation tests while slowly drifting every downstream mixed-quant matvec.
Dequantization in Registers
The most important optimization in CKE’s quantized kernels is not any one instruction. It is the policy: dequantize into registers, use immediately in multiply-accumulate, never spill intermediate FP32 weights to memory. That policy is explicitly stated in dequant_kernels.c and materialized in the AVX and AVX-512 kernels.
This matters because a naive implementation would decode 4-bit or 6-bit weights into a temporary FP32 buffer, write that buffer to RAM, reload it for the dot product, and then finally compute the result. That destroys much of the bandwidth win that quantization created in the first place.
src/kernels/dequant_kernels.c/*
* Key optimization: Dequantize into registers, use immediately in FMA,
* never write intermediate FP32 values to memory.
*/gemv_q4_k_avx512() from src/kernels/gemm_kernels_q4k.cvoid gemv_q4_k_avx512(float *y,
const void *W,
const float *x,
int M, int K)
{
const block_q4_K *blocks = (const block_q4_K *)W;
const int blocks_per_row = K / QK_K;
for (int row = 0; row < M; row++) {
__m512 acc = _mm512_setzero_ps();
for (int b = 0; b < blocks_per_row; b++) {
const block_q4_K *block = &blocks[row * blocks_per_row + b];
const float d = GGML_FP16_TO_FP32(block->d);
const float dmin = GGML_FP16_TO_FP32(block->dmin);
uint8_t sc[8], m_arr[8];
unpack_q4_k_scales(block->scales, sc, m_arr);
const __m512i mask_lo = _mm512_set1_epi32(0x0F);
for (int iter = 0; iter < 4; iter++) {
const float d1 = d * (float)sc[2*iter];
const float m1 = dmin * (float)m_arr[2*iter];
const float d2 = d * (float)sc[2*iter + 1];
const float m2 = dmin * (float)m_arr[2*iter + 1];
const __m512 vscale1 = _mm512_set1_ps(d1);
const __m512 vmin1 = _mm512_set1_ps(m1);
const __m512 vscale2 = _mm512_set1_ps(d2);
const __m512 vmin2 = _mm512_set1_ps(m2);
const uint8_t *qs = &block->qs[iter * 32];
const float *xp = &x[b * QK_K + iter * 64];
for (int chunk = 0; chunk < 2; chunk++) {
__m128i packed = _mm_loadu_si128((const __m128i *)&qs[chunk * 16]);
__m512i bytes = _mm512_cvtepu8_epi32(packed);
__m512i lo = _mm512_and_epi32(bytes, mask_lo);
__m512 w = _mm512_fnmadd_ps(_mm512_set1_ps(1.0f), vmin1,
_mm512_mul_ps(_mm512_cvtepi32_ps(lo), vscale1));
__m512 x_vec = _mm512_loadu_ps(&xp[chunk * 16]);
acc = _mm512_fmadd_ps(w, x_vec, acc);
}
}
}
}
}naive path:
unpack q4 -> temporary fp32 buffer
store temporary fp32 buffer to RAM
reload fp32 buffer from RAM
multiply by activation vector
accumulate
CKE path:
unpack q4 in registers
scale in registers
FMA immediately
keep accumulator live in SIMD stateThis optimization is why quantization and SIMD should be discussed together. Compression gives the core fewer bytes to fetch. In-register dequantization makes sure the runtime does not give that win back by manufacturing unnecessary FP32 traffic. 0 spills The ideal inner loop never materializes a dequantized weight buffer in memory. It consumes decoded values the moment they exist.
For AVX-512 specifically, this also meshes beautifully with the Q4_K block geometry. Sixteen FP32 lanes let the kernel decode and consume 16 weights at a time, then repeat across the low and high nibble halves of the block.
ISA Tiers for Quantized Kernels
One of the strongest signals in CKE is that quantized inference is not implemented in one generic file with a few opportunistic intrinsics. It is stratified by ISA tier. There are scalar fallbacks, SSE or SSSE3 files, AVX files, AVX2 files, AVX-512 files, VNNI-specific files, AMX files, and NEON paths for the formats already covered in the ARM NEON kernel post.
This is exactly what CPU vendors want to review: not just that an optimization exists, but that the codebase has a clear picture of which architecture deserves its own kernel.
| ISA tier | Representative files | What changes in the kernel |
|---|---|---|
| SSE / SSSE3 | gemm_kernels_q4k_sse.c, gemm_kernels_q5_0_sse.c, gemm_kernels_q5_0_sse_v2.c, gemm_kernels_q6k_sse.c, quantize_row_q8_k_sse.c | Narrow vector width, explicit packing and reduction care, parity-sensitive contracts. |
| AVX | gemm_kernels_q4k_avx.c, quantize_row_q8_k_avx.c | 256-bit vectors without the full AVX2 / VNNI integer surface. |
| AVX2 | gemm_kernels_q4k_q8k_avx2.c, quantize_row_q8_k_avx2.c | Better integer throughput and fused integer dot-product building blocks. |
| AVX-512 | gemm_kernels_q4k.c, gemm_kernels_q5_0.c, gemm_kernels_q8_0.c, quantize_row_q8_k_avx512.c | 16-lane FP32 work, cleaner reductions, denser byte unpack operations. |
| VNNI | gemm_kernels_q4k_q8k_vnni.c | Hardware byte-dot instructions collapse multiple INT8 products into one accumulating op. |
| AMX | gemm_kernels_amx.c | Tile registers and matrix-style INT8/BF16 execution for larger GEMM cases. |
| ARM NEON | gemm_kernels_q8_0.c, gemm_kernels_q5_0.c, gemm_kernels_q6k_q8k.c | 128-bit vectors with widening multiplies and explicit reduction idioms, as shown in the ARM NEON kernel post. |
AMX
↓
AVX-512 VNNI / AVX-512
↓
AVX2
↓
AVX
↓
SSE / SSSE3
↓
Scalar reference
ARM64 side:
NEON where present
scalar fallback otherwiseSSE / SSSE3 tier:
gemm_kernels_q4k_sse.c
gemm_kernels_q5_0_sse.c
gemm_kernels_q5_0_sse_v2.c
gemm_kernels_q6k_sse.c
quantize_row_q8_k_sse.c
AVX tier:
gemm_kernels_q4k_avx.c
quantize_row_q8_k_avx.c
AVX2 / AVX-512 / VNNI / AMX tier:
gemm_kernels_q4k_q8k_avx2.c
quantize_row_q8_k_avx2.c
quantize_row_q8_k_avx512.c
gemm_kernels_q4k_q8k_vnni.c
gemm_kernels_amx.csrc/kernels/gemm_kernels_q6k_q8k.cstatic float dot_q6_k_q8_k_neon(const block_q6_K *w,
const block_q8_K *x,
int K)
{
const int nb = K / QK_K;
float sumf = 0.0f;
for (int i = 0; i < nb; ++i) {
const float d = GGML_FP16_TO_FP32(w[i].d) * x[i].d;
...
int32x4_t acc = vdupq_n_s32(0);
for (int j = 0; j < QK_K; j += 16) {
const int8x16_t wv = vld1q_s8(&wvals[j]);
const int8x16_t sv = vld1q_s8(&svals[j]);
const int8x16_t xv = vld1q_s8(&q8[j]);
...
acc = vaddq_s32(acc, p0);
acc = vaddq_s32(acc, p1);
}
}
}Seen from far enough away, the ISA tiers are all spelling the same sentence: decode bytes, apply scales, multiply, accumulate, reduce. What changes is how much width and how many fused operations the hardware offers for each stage. the SIMD deep dive established the x86 ladder. the ARM NEON kernel post showed that ARM is already on the board. This post adds the missing detail: the quantized ladder is concrete at every rung.

VNNI — The INT8 Dot Product Instruction
VNNI is where modern Intel client and server cores start looking purpose-built for quantized inference. The crucial instruction is _mm256_dpbusd_epi32, which performs a byte dot product and accumulates four products into each 32-bit lane. Instead of manually unpacking, widening, and summing every byte pair, the hardware performs a large chunk of the inner-loop work directly.
CKE’s gemm_kernels_q4k_q8k_vnni.c is therefore a very explicit proof of ISA fluency. It names the instruction, arranges the bytes the way VNNI wants them, and uses the bsums side channel from Q8_K to handle the offset terms cleanly.
hsum256_epi32() from src/kernels/gemm_kernels_q4k_q8k_vnni.cstatic inline int32_t hsum256_epi32(__m256i v) {
__m128i lo = _mm256_castsi256_si128(v);
__m128i hi = _mm256_extracti128_si256(v, 1);
__m128i sum = _mm_add_epi32(lo, hi);
sum = _mm_hadd_epi32(sum, sum);
sum = _mm_hadd_epi32(sum, sum);
return _mm_cvtsi128_si32(sum);
}dot_q4_k_q8_k_32_vnni() from src/kernels/gemm_kernels_q4k_q8k_vnni.cstatic inline int32_t dot_q4_k_q8_k_32_vnni(const uint8_t *q4_packed_32,
const int8_t *q8_32,
int high_nibble) {
const __m256i packed = _mm256_loadu_si256((const __m256i *)q4_packed_32);
const __m256i mask4 = _mm256_set1_epi8(0x0F);
const __m256i q4_bytes = high_nibble
? _mm256_and_si256(_mm256_srli_epi16(packed, 4), mask4)
: _mm256_and_si256(packed, mask4);
const __m256i q8_bytes = _mm256_loadu_si256((const __m256i *)q8_32);
__m256i acc = _mm256_setzero_si256();
acc = _mm256_dpbusd_epi32(acc, q4_bytes, q8_bytes);
return hsum256_epi32(acc);
}bsums and dminfor (int j = 0, is = 0, q_offset = 0; j < QK_K; j += 64, is += 2, q_offset += 32) {
const uint8_t *qs = &w->qs[q_offset];
const int8_t *q8_lo = &x->qs[j];
const int8_t *q8_hi = &x->qs[j + 32];
const int32_t sum_lo = dot_q4_k_q8_k_32_vnni(qs, q8_lo, 0);
const int32_t sum_hi = dot_q4_k_q8_k_32_vnni(qs, q8_hi, 1);
const int32_t bsum_lo = (int32_t)x->bsums[j / 16] +
(int32_t)x->bsums[j / 16 + 1];
const int32_t bsum_hi = (int32_t)x->bsums[(j + 32) / 16] +
(int32_t)x->bsums[(j + 32) / 16 + 1];
sumf += d * (float)sc[is] * (float)sum_lo;
sumf -= dmin * (float)m_val[is] * (float)bsum_lo;
sumf += d * (float)sc[is + 1] * (float)sum_hi;
sumf -= dmin * (float)m_val[is + 1] * (float)bsum_hi;
} Two things stand out. First, VNNI lets the inner loop become a sequence of hardware byte-dot operations instead of a home-grown multiply-and-reduce pipeline. Second, the Q8_K block-sum field exists precisely because the kernel wants to pay the offset cost once, not rediscover it expensively on every call. This is the kind of kernel a modern Intel reviewer wants to see: the runtime knows when the ISA offers a byte-dot primitive and restructures the math around it rather than pretending all vector extensions are equivalent.
VNNI does not make correctness easier by itself. CKE still keeps the fast path behind an explicit environment gate because changing accumulation order can move borderline logits. That is another sign of mature engineering: hardware enthusiasm checked by numerical discipline.
AMX — Matrix Tiles for Quantized GEMM
AMX is a different tier of hardware support entirely. Instead of simply widening SIMD registers, Intel adds tile registers and instructions that are much closer to dedicated matrix execution. In CKE, gemm_kernels_amx.c documents the model clearly: tile configuration, tile loads, tile dot products, and explicit runtime detection.
The most important practical nuance is where AMX matters. For single-token decode, quantized GEMV often dominates, and the K-quant formats still make the scalar or VNNI-friendly path the most natural implementation. For larger prefill-style matrix multiplies, however, AMX becomes a far more compelling execution surface.
__tile_config from src/kernels/gemm_kernels_amx.ctypedef struct __tile_config {
uint8_t palette_id;
uint8_t start_row;
uint8_t reserved_0[14];
uint16_t colsb[16]; /* Columns in bytes for each tile */
uint8_t rows[16]; /* Rows for each tile */
} __tile_config;configure_tiles_gemm() from src/kernels/gemm_kernels_amx.cstatic void configure_tiles_gemm(int M, int N, int K) {
__tile_config config = {0};
config.palette_id = 1;
int tile_m = (M > AMX_TILE_M) ? AMX_TILE_M : M;
int tile_k = (K > AMX_TILE_K) ? AMX_TILE_K : K;
int tile_n = (N > AMX_TILE_N) ? AMX_TILE_N : N;
config.rows[TILE_A] = tile_m;
config.colsb[TILE_A] = tile_k;
config.rows[TILE_B] = tile_k;
config.colsb[TILE_B] = tile_n * 4;
config.rows[TILE_C] = tile_m;
config.colsb[TILE_C] = tile_n * 4;
_tile_loadconfig(&config);
}gemm_amx_int8_core() inner loop from src/kernels/gemm_kernels_amx.cfor (int m = 0; m < M; m += AMX_TILE_M) {
int tile_m = (m + AMX_TILE_M <= M) ? AMX_TILE_M : (M - m);
for (int n = 0; n < N; n += AMX_TILE_N) {
int tile_n = (n + AMX_TILE_N <= N) ? AMX_TILE_N : (N - n);
_tile_zero(TILE_C);
for (int k = 0; k < K; k += AMX_TILE_K) {
int tile_k = (k + AMX_TILE_K <= K) ? AMX_TILE_K : (K - k);
_tile_loadd(TILE_A, A + m * K + k, K);
_tile_loadd(TILE_B, B + k * N + n, N * 4);
_tile_dpbssd(TILE_C, TILE_A, TILE_B);
}
_tile_stored(TILE_C, C + m * N + n, N * 4);
}
}amx_available() from src/kernels/gemm_kernels_amx.cbool amx_available(void) {
unsigned int eax, ebx, ecx, edx;
__asm__ __volatile__(
"cpuid"
: "=a"(eax), "=b"(ebx), "=c"(ecx), "=d"(edx)
: "a"(7), "c"(0)
);
bool has_amx_tile = (edx >> 24) & 1;
bool has_amx_int8 = (edx >> 25) & 1;
unsigned int xcr0_lo = 0, xcr0_hi = 0;
__asm__ __volatile__(
".byte 0x0f, 0x01, 0xd0"
: "=a"(xcr0_lo), "=d"(xcr0_hi)
: "c"(0)
);
uint64_t xcr0 = ((uint64_t)xcr0_hi << 32) | xcr0_lo;
bool os_tile_state_enabled = (xcr0 & 0x60000) == 0x60000;
return has_amx_tile && has_amx_int8 && os_tile_state_enabled;
}The AMX file is valuable even where it currently falls back for some K-quant cases. It proves the runtime is architected for matrix extensions rather than only for traditional vector ISAs. The engineering question becomes “which data layouts deserve native AMX paths next?” rather than “can the runtime even see AMX?” 8 tiles AMX exposes a different optimization surface from AVX-512: tile registers, tile loads, and dedicated dot-product instructions for larger GEMM shapes.
In short: VNNI is the elite byte-dot vector path; AMX is the matrix-engine path. A serious CPU inference runtime should understand both. CKE visibly does.
The Q8_K SSE Parity Bug — A War Story
The most revealing engineering stories are often the bugs that slip past casual testing. On 2026-03-09, CKE fixed a subtle Q8_K SSE parity bug in quantize_row_q8_k_sse.c. The root issue was not a dramatic crash. It was a contract mismatch: signed-max selection and bsums behavior no longer exactly matched the llama.cpp-style reference path.
That kind of bug is pernicious because outputs can still look mostly normal. A short text-generation sanity check might pass. But once the model repeatedly crosses a mixed-quant boundary like quantize_row_q8_k -> gemv_q4_k_q8_k, tiny mismatches can accumulate into parity drift.
In this case, the bug only surfaced during Nanbeige model bring-up with strict per-op checks. That is exactly the lesson silicon teams should take seriously: quantization bugs often survive demos and only fail under the discipline of exact or near-exact parity validation.
quantize_row_q8_k_sse.c/* Keep the exact signed-max selection contract from llama.cpp/ref. */
float max = 0.0f;
float amax = 0.0f;
for (int j = 0; j < QK_K; ++j) {
const float xv = x[j];
const float ax = fabsf(xv);
if (ax > amax) {
amax = ax;
max = xv;
}
}
if (amax == 0.0f) {
y[i].d = 0.0f;
memset(y[i].qs, 0, sizeof(y[i].qs));
memset(y[i].bsums, 0, sizeof(y[i].bsums));
x += QK_K;
continue;
}bsums contract from quantize_row_q8_k_sse.cconst float iscale = -127.0f / max;
...
for (int j = 0; j < QK_K; j += 16) {
const __m128 x0 = _mm_loadu_ps(x + j + 0);
const __m128 x1 = _mm_loadu_ps(x + j + 4);
const __m128 x2 = _mm_loadu_ps(x + j + 8);
const __m128 x3 = _mm_loadu_ps(x + j + 12);
...
const __m128i q0123 = _mm_packs_epi16(q01, q23);
_mm_storeu_si128((__m128i *)(y[i].qs + j), q0123);
int sum = 0;
for (int ii = 0; ii < 16; ++ii) {
sum += y[i].qs[j + ii];
}
y[i].bsums[j / 16] = (int16_t)sum;
}
y[i].d = 1.0f / iscale;Bug class:
wrong signed-max selection
wrong Q8_K bsums contract
Observed symptom:
outputs still looked plausible
mixed-quant parity drift accumulated slowly
Why it finally surfaced:
Nanbeige bring-up
strict parity checking
mixed-quant boundary exercised repeatedly
Reference fix:
commit 224a4d30The lesson is not merely “bugs happen.” The lesson is that quantization demands a higher standard of validation than casual end-to-end smoke tests. The output distribution can remain superficially sane while the internal contract is already broken. The parity bug is strong evidence that CKE is operating at the right engineering depth. Only teams living close to the byte contract ever discover bugs this subtle—and only disciplined parity workflows catch them reliably.

That is why CKE’s kernel rules matter: strict parity fallbacks are not there for decoration. They are how the team proves each ISA-specific implementation still preserves the intended contract.
Cache Line Geometry
Quantization is often explained with compression ratios, but cache geometry is the more revealing lens for CPU reviewers. The question is not just how many bits a weight occupies on disk. It is how many useful weights arrive per cache-line fetch once the kernel starts streaming through blocks.
Q4_0 is the easy example: 18 bytes per block means a 64-byte line holds a little more than 3.5 blocks, or about 112 weights worth of payload plus scales. Q4_K is larger and more structured: 144 bytes per super-block, which spans 2.25 cache lines. That is still attractive because one fetch bundle brings in 256 weights and their local metadata together.
This is why co-location matters. If the scales lived in one array and the packed weights lived somewhere else, the bandwidth win would erode immediately under extra cache misses and weaker prefetch behavior.
Q4_0:
18 bytes / block
64 / 18 ≈ 3.55 blocks / line
≈ 112 weights / cache line fetch
Q8_0:
34 bytes / block
64 / 34 ≈ 1.88 blocks / line
≈ 60 weights / cache line fetch
Q4_K:
144 bytes / super-block
144 / 64 = 2.25 cache lines / block
256 weights arrive with nested metadatagood layout:
[scale | packed weights | next block]
bad layout:
[all scales elsewhere]
[all packed weights elsewhere]
Reason:
the kernel wants one streaming access pattern
not two unrelated pointer chases
| Format | Bytes per block | Weights per block | Cache implication |
|---|---|---|---|
Q4_0 | 18 | 32 | Excellent density; tiny metadata overhead. |
Q8_0 | 34 | 32 | Higher fidelity, lower line density. |
Q4_K | 144 | 256 | Super-block spans multiple lines but amortizes metadata well. |
Q8_K | 292 | 256 | Larger activation block, but tailored to mixed-quant compute rather than pure storage density. |
There is also a neat mapping between block sizes and SIMD width. A 32-weight Q4_0 block lines up naturally with AVX-512 as two 16-lane FP32 chunks. A 256-weight K-quant block lines up naturally with loop nests that process 64-value quarters or 16-value sub-groups, depending on the ISA. 16 FP32 lanes AVX-512 turns a 32-weight simple block into two clean passes, which is one reason quantized decode loops often look structurally elegant on wide x86 tiers.
Bump Allocator Integration
CKE’s quantization docs make another systems choice explicit: quantized weights, FP32 activations, and scratch space live in separate bump-allocator regions. That sounds mundane until you realize how many runtime bugs and performance regressions it avoids.
Homogeneous regions improve cache behavior and alignment policy. They also prevent type confusion by construction. A pointer to block_q4_0 or block_q4_K does not share storage with live FP32 activations, and the scratch space used by temporary compute stages does not contaminate the read-only weight region.
typedef struct {
uint8_t *weights_q4; // Region 1: quantized weights
size_t weights_size;
float *activations; // Region 2: FP32 activations
size_t act_size;
float *scratch; // Region 3: temporary workspace
size_t scratch_size;
float *dequant_cache; // Optional hot dequant cache
size_t cache_size;
} BumpAllocator;size_t num_blocks = (num_weights + 31) / 32;
size_t q4_0_size = num_blocks * sizeof(block_q4_0);
void *weights = bump_alloc(allocator, q4_0_size, REGION_WEIGHTS);
size_t act_size = batch * seq_len * hidden_dim * sizeof(float);
float *activations = bump_alloc(allocator, act_size, REGION_ACTIVATIONS);Region 1:
quantized weights
read-only after model load
Region 2:
FP32 activations / hidden states
read-write during inference
Region 3:
scratch / temporary compute buffers
ephemeral per op or per stageck_dtype_block_size() from include/ckernel_dtype.hstatic inline size_t ck_dtype_block_size(CKDataType dt)
{
switch (dt) {
case CK_DT_Q4_0:
case CK_DT_Q4_1:
case CK_DT_Q5_0:
case CK_DT_Q5_1:
case CK_DT_Q8_0:
return 32;
case CK_DT_Q4_K:
case CK_DT_Q5_K:
case CK_DT_Q6_K:
case CK_DT_Q8_K:
return 256;
default:
return 1;
}
}ck_dtype_block_bytes() from include/ckernel_dtype.hstatic inline size_t ck_dtype_block_bytes(CKDataType dt)
{
switch (dt) {
case CK_DT_Q4_0:
return 18;
case CK_DT_Q4_1:
return 20;
case CK_DT_Q5_0:
return 22;
case CK_DT_Q5_1:
return 24;
case CK_DT_Q4_K:
return 144;
case CK_DT_Q6_K:
return 210;
case CK_DT_Q8_0:
return 34;
case CK_DT_Q8_K:
return 292;
default:
return 0;
}
}ck_dtype_is_quantized() from include/ckernel_dtype.hstatic inline int ck_dtype_is_quantized(CKDataType dt)
{
return dt == CK_DT_Q4_0 || dt == CK_DT_Q4_1 || dt == CK_DT_Q5_0 || dt == CK_DT_Q5_1 ||
dt == CK_DT_Q5_K || dt == CK_DT_Q4_K || dt == CK_DT_Q6_K || dt == CK_DT_Q8_0 || dt == CK_DT_Q8_K;
}This allocator design is the quiet partner of the kernel work. Quantized kernels perform best when their memory neighborhoods are predictable. Keeping regions homogeneous helps hardware prefetchers, simplifies alignment reasoning, and makes the dtype dispatch story materially safer. A runtime that cares about byte-level quantization but is sloppy about memory regions is only doing half the job. CKE’s allocator model shows the other half.
This is also where the format structs become operationally useful. Allocation is not “reserve some bytes and hope.” It is num_blocks × sizeof(block_q4_0), num_blocks × sizeof(block_q4_K), and so on. The type system and the storage plan reinforce each other.
Conclusion — 32 Files, 11,652 Lines, 8 Formats
The quantization story in CKE is not a feature flag and not a library dependency papered over with wrapper code. It is a dedicated kernel surface: simple formats, K-quants, dequantizers, activation quantizers, fused quantized normalization and attention code, and dispatch rules that explicitly target the ISA tiers modern CPU vendors care about.
The most important production path today is the mixed one: Q4_K or Q6_K weights multiplied against Q8_K activations produced on the fly from FP32 hidden state. That path explains why Q8_K needs FP32 scale precision, why bsums exist, why VNNI is such a natural match, and why the Q8_K SSE parity fix mattered so much.
The repository depth is the real message. Scalar through AVX-512, VNNI, and AMX on x86. Existing NEON paths on ARM. Register-level dequantization. Byte-level packing. Strict parity fallbacks. And a bug history subtle enough to prove the team is working at the right abstraction layer.
Where to go next
Read the official CKE quantization deep dive, the byte-level format reference, and the broader scaling thesis.
Then inspect the source directly at github.com/antshiv/C-Kernel-Engine, including the March 2026 Q8_K SSE parity fix, and pair this post with the SIMD deep dive and the ARM NEON kernel post for the full ISA-level picture.
CKE quantization story:
8 headline formats in docs
mixed-quant runtime path in production
DType-based dispatch outside the hot loop
dequantize in registers, not RAM
ISA tiers from SSE through AVX-512 VNNI and AMX
ARM NEON coverage already live
strict parity matters enough to catch subtle Q8_K bugs