ML systems · Quantization internals

This ShivasNotes deep dive is the granular companion to the broader Quantization Deep Dive in this series. The earlier post explained why quantization matters for CPU inference. This one opens the K-quant formats themselves: Q4_K, Q5_K, Q6_K, Q8_K, and the mixed K × Q8_K dot products that C-Kernel-Engine has to execute correctly.

The thesis is simple: K-quants are not just smaller floats. They are byte layouts, scale hierarchies, packed nibbles, sub-block metadata, correction terms, activation quantization, and strict parity contracts. If you cannot explain the bytes, you do not really own the kernel. This is where quantization turns from “divide by scale and round” into systems engineering. The math is small. The bookkeeping is the work.

What this post covers

Sections 1 through 4 explain why K-quants use 256-weight super-blocks and how Q4_K, Q5_K, Q6_K, and Q8_K store their values.

Sections 5 through 10 connect the formats to the actual C-Kernel-Engine runtime path: activation quantization into Q8_K, mixed quantized dot products, SIMD dispatch, parity bugs, and the mental model needed before writing AVX2/VNNI/NEON kernels.

Section 1: Why K-Quants Exist

Simple block quantization already works better than one scale for a whole tensor. Instead of one global scale, formats like Q4_0 use one FP16 scale for every 32 weights. That is the first important idea: preserve local dynamic range.

K-quants go one level deeper. They group 256 weights into a super-block and store nested metadata inside that super-block. The weights are still packed into low-bit integers, but the scale information becomes more structured. This lets the format reduce metadata overhead while preserving more local information than a naive single-scale block.

That is why K-quants matter for CPU inference. They reduce bytes moved from memory, but still preserve enough local detail that the model behaves like the original model. The price is that kernels become more complex: they must unpack the integers, unpack the scale hierarchy, apply correction terms, and accumulate with exactly the same semantics as the reference implementation.

The trade is not free. K-quants save memory bandwidth, but they spend instruction complexity. A CPU runtime wins only if the unpack/decode work is cheaper than moving full-precision weights through memory. 256 K-quant formats use 256-weight super-blocks. The kernel advances through the model in these layout units.

K-quant super-block diagram showing simple 32-weight block quantization versus 256-weight K-quant super-blocks with nested metadata.Simple block quantization versus K-quant super-blockstext
Simple 32-weight format:
  Q4_0:
    d       - one FP16 scale
    qs[16]  - 32 weights, 4 bits each

K-quant 256-weight format:
  Q4_K:
    d          - super-block scale
    dmin       - super-block minimum scale
    scales[12] - packed sub-block scales and mins
    qs[128]    - 256 weights, 4 bits each

Section 2: The Format Contract

C-Kernel-Engine keeps the K-quant format contract in include/ckernel_quant.h. This matters because the model loader, memory planner, scalar reference kernel, SIMD kernel, and parity tests must all agree on the same layout.

If one layer interprets scales[12] differently from another layer, the model may still run. It will simply produce wrong logits. That is the dangerous class of quantization bug: no crash, no obvious invalid pointer, just numerically plausible nonsense.

FormatSuper-blockMetadataPayloadDecode idea
Q4_K256 weightsd, dmin, packed scales/mins128 bytes of 4-bit valuesd × scale × q4 − dmin × min
Q5_K256 weightssimilar K hierarchy plus high-bit storage4-bit low values plus high bitmore fidelity than Q4_K, more unpack work
Q6_K256 weights16 int8 scales + FP16 super-scaleql[128] low bits + qh[64] high bitsd × scale × (q6 − 32)
Q8_K256 valuesFP32 scale + block sums256 signed int8 valuesactivation-side bridge for mixed dots

Core K-quant structs in C-Kernel-Enginec
#define QK_K 256
#define K_SCALE_SIZE 12

typedef struct {
    ck_half d;                    /* super-block scale */
    ck_half dmin;                 /* super-block minimum */
    uint8_t scales[K_SCALE_SIZE]; /* 8 scales + 8 mins, 6-bit packed */
    uint8_t qs[QK_K / 2];         /* 256 x 4-bit weights */
} block_q4_K;

typedef struct {
    uint8_t ql[QK_K / 2];      /* low 4 bits */
    uint8_t qh[QK_K / 4];      /* high 2 bits */
    int8_t scales[QK_K / 16];  /* 16 sub-block scales */
    ck_half d;                 /* super-block scale */
} block_q6_K;

typedef struct {
    float d;                  /* activation scale */
    int8_t qs[QK_K];          /* 256 signed int8 values */
    int16_t bsums[QK_K / 16]; /* block sums for optimization */
} block_q8_K;

Section 3: Q4_K — The Format That Looks Simple Until You Decode It

Q4_K stores 256 weights in 144 bytes. The obvious part is qs[128]: each byte holds two 4-bit values. The non-obvious part is scales[12], which packs eight scale values and eight minimum values into 6-bit fields.

The decode is not simply weight = q × scale. It has a scale term and a minimum-correction term. In the CKE scalar reference kernel, the dot product computes integer products weighted by the unpacked sub-block scales, then subtracts the dmin correction using Q8_K block sums.

Q4_K layout with d, dmin, packed 12-byte scales field, and 128-byte packed 4-bit weight payload.Q4_K scale/min unpacking contractc
static inline void unpack_q4_k_scales(const uint8_t *scales,
                                      uint8_t *sc, uint8_t *m) {
    sc[0] = scales[0] & 0x3F;
    sc[1] = scales[1] & 0x3F;
    sc[2] = scales[2] & 0x3F;
    sc[3] = scales[3] & 0x3F;

    m[0] = scales[4] & 0x3F;
    m[1] = scales[5] & 0x3F;
    m[2] = scales[6] & 0x3F;
    m[3] = scales[7] & 0x3F;

    sc[4] = (scales[8]  & 0x0F) | ((scales[0] >> 6) << 4);
    sc[5] = (scales[9]  & 0x0F) | ((scales[1] >> 6) << 4);
    sc[6] = (scales[10] & 0x0F) | ((scales[2] >> 6) << 4);
    sc[7] = (scales[11] & 0x0F) | ((scales[3] >> 6) << 4);

    m[4] = (scales[8]  >> 4) | ((scales[4] >> 6) << 4);
    m[5] = (scales[9]  >> 4) | ((scales[5] >> 6) << 4);
    m[6] = (scales[10] >> 4) | ((scales[6] >> 6) << 4);
    m[7] = (scales[11] >> 4) | ((scales[7] >> 6) << 4);
}

The first real lesson of Q4_K is that the scale array is not an array of bytes. It is a 12-byte bit field. Treating it like ordinary metadata is how subtle parity bugs enter the runtime. 12B The entire Q4_K scale/min hierarchy for 256 weights is packed into only twelve bytes.

Section 4: Q5_K — The Middle Tier

Q5_K sits between Q4_K and Q6_K. The idea is straightforward: spend one more bit per weight than Q4-style storage so the quantized value can represent more levels. But that one extra bit does not arrive for free. It usually means high-bit packing, extra unpack logic, and additional places for SIMD paths to disagree with scalar reference code.

In practice, Q5_K is useful as a mental bridge. If Q4_K teaches “nibbles plus nested scale/min metadata,” and Q6_K teaches “low 4 bits plus high 2 bits and signed sub-scales,” then Q5_K is the intermediate form where the kernel writer starts to see why bit planes matter.

FormatWhat improvesWhat becomes harder
Q4_KBest compression among the K formats discussed here.Scale/min correction and nibble unpacking.
Q5_KMore quantization levels than Q4_K.High-bit handling and parity with mixed paths.
Q6_KBetter fidelity; cleaner signed reconstruction.Low/high bit-plane extraction and more bytes moved.

How to think about the Q5_K tiertext
Q4_K:
  4-bit payload
  strong compression
  scale/min correction is central

Q5_K:
  5-bit payload
  better value resolution
  extra high-bit bookkeeping

Q6_K:
  6-bit payload
  higher fidelity
  low 4 bits + high 2 bits + signed recentering

Section 5: Q6_K — Low Bits, High Bits, and the −32 Centering Step

Q6_K uses a different shape from Q4_K. It stores low 4 bits in ql[128], high 2 bits in qh[64], sixteen signed sub-block scales, and one FP16 super-block scale.

The scalar reference code reconstructs a 6-bit unsigned value, then subtracts 32 to recenter it into a signed range. That detail is not cosmetic. If the kernel forgets the −32, the dot product receives a large positive bias.

Q6_K layout showing ql low 4-bit plane, qh high 2-bit plane, sixteen int8 scales, and FP16 super-block scale.Q6_K scalar reconstruction patternc
const int8_t q1 =
    (int8_t)((ql[l + 0] & 0xF) |
    (((qh[l] >> 0) & 3) << 4)) - 32;

const int8_t q2 =
    (int8_t)((ql[l + 32] & 0xF) |
    (((qh[l] >> 2) & 3) << 4)) - 32;

const int8_t q3 =
    (int8_t)((ql[l + 0] >> 4) |
    (((qh[l] >> 4) & 3) << 4)) - 32;

const int8_t q4 =
    (int8_t)((ql[l + 32] >> 4) |
    (((qh[l] >> 6) & 3) << 4)) - 32;

That extraction pattern is the kind of code that looks ugly because the data layout is optimized for compactness, not for human readability. The correct implementation is the one that agrees with the reference format.

Section 6: Q8_K — Why Activations Become Quantized Too

For decode-style inference, the model weights can stay quantized on disk and in memory. But the activation vector usually arrives as FP32 or BF16. To do a mixed quantized dot product efficiently, CKE quantizes that activation row into Q8_K blocks on the fly.

This is why Q8_K appears in kernels like gemv_q4_k_q8_k and gemv_q6_k_q8_k. The left side is a compressed weight row. The right side is a temporary quantized activation representation. The dot product can then accumulate integer products and apply floating scale terms at the block boundary.

Mixed quantized dot path where Q4_K or Q6_K weights multiply Q8_K activations.Q8_K activation quantization pathc
void quantize_row_q8_k(const float *x, void *vy, int k) {
#if defined(__AVX512F__) && defined(__AVX512BW__)
    quantize_row_q8_k_avx512(x, vy, k);
#elif defined(__AVX2__)
    quantize_row_q8_k_avx2(x, vy, k);
#elif defined(__AVX__)
    quantize_row_q8_k_avx(x, vy, k);
#elif defined(__SSE4_1__)
    quantize_row_q8_k_sse(x, vy, k);
#else
    quantize_row_q8_k_ref(x, vy, k);
#endif
}

The activation quantizer is part of the inference hot path. If it is slow, the K-quant GEMV cannot win. If it is numerically different from the reference path, the optimized kernel inherits the error. A mixed quant kernel is only as trustworthy as both halves: the weight format decoder and the activation quantizer.

Section 7: The Q4_K × Q8_K Dot Product

The Q4_K × Q8_K reference dot product shows the whole algorithm in compact form. First unpack the Q4_K scale/min fields. Then compute the product of packed Q4 values and Q8 activation values. Then apply scale terms. Then subtract the minimum correction using bsums.

The correction term is the part worth slowing down for. Because Q4_K has a min term, the dot product has to account for the contribution of that minimum across the activation block. Q8_K stores block sums partly to make that correction cheap.

Q4_K × Q8_K scalar reference shapec
uint8_t sc[8], m_val[8];
unpack_q4_k_scales(w[i].scales, sc, m_val);

const float d = CK_FP16_TO_FP32(w[i].d) * x[i].d;
const float dmin = CK_FP16_TO_FP32(w[i].dmin) * x[i].d;

int32_t aux32[8] = {0};
int sumi = 0;
for (int j = 0; j < QK_K / 16; ++j) {
    sumi += (int)x[i].bsums[j] * (int)m_val[j / 2];
}

/* q4 nibble products accumulate into aux32 */
/* then apply scale and min correction */
sumf += d * aux32_sum;
sumf -= dmin * (float)sumi;

This is the real difference between “I understand 4-bit quantization” and “I understand Q4_K.” The first says “unpack nibbles and multiply by scale.” The second says “unpack nibbles, unpack scale/min metadata, use activation block sums, and subtract the correction term with exactly the right indexing.”

Section 8: The Q6_K × Q8_K Dot Product

Q6_K × Q8_K has a different hot loop. There is no dmin correction term like Q4_K. Instead, the code reconstructs signed 6-bit values, multiplies each by its signed sub-block scale, multiplies by the Q8 activation, and accumulates.

That makes the Q6_K mental model cleaner in one way: weight ≈ d × scale[sub] × (q6 − 32). But the bit extraction is more complex because each group of values is split across low and high bit planes.

Q6_K × Q8_K reference accumulationc
aux32[l & 7] += (int)sc[is + 0] * (int)q1 * (int)q8[l + 0];
aux32[l & 7] += (int)sc[is + 2] * (int)q2 * (int)q8[l + 32];
aux32[l & 7] += (int)sc[is + 4] * (int)q3 * (int)q8[l + 64];
aux32[l & 7] += (int)sc[is + 6] * (int)q4 * (int)q8[l + 96];

for (int l = 0; l < 8; ++l) {
    sums[l] += d * (float)aux32[l];
}

When this is vectorized, the structure becomes: unpack bit planes, recenter values, multiply by scales, multiply by Q8 activations, accumulate in int32 lanes, then apply the floating block scale. The SIMD implementation can be completely different mechanically, but it must be identical semantically.

Section 9: Dispatch Is Part of Correctness

C-Kernel-Engine keeps scalar reference implementations because they are the oracle. Optimized paths exist for AVX, AVX2, AVX-512, VNNI, SSE, and NEON depending on the kernel. But dispatch is not only about speed. It is also about debug control.

For example, CK_DEBUG_Q8K_REF can force the reference Q8_K activation quantizer. CK_DEBUG_Q6K_Q8K_REF can force the Q6_K/Q8_K reference path. Those switches matter because when a model diverges, you need to isolate whether the error is in the format decode, activation quantization, dot product, or SIMD reduction.

Debug switches make the optimized path falsifiabletext
CK_DEBUG_Q8K_REF=1
  force scalar Q8_K activation quantization

CK_DEBUG_Q6K_Q8K_REF=1
  force scalar Q6_K × Q8_K dot product

Purpose:
  compare scalar vs SIMD
  isolate quantizer bugs from dot-product bugs
  protect model bring-up with parity gates
Common K-quant bug surfaces: nibble order, scale unpacking, dmin correction, recentering, block stride, and SIMD parity.

Section 10: The Kernel Writer’s Checklist

Before writing an optimized K-quant kernel, the checklist is mechanical. This is the exact reason these posts help harden C-Kernel-Engine: if the explanation cannot survive the checklist, the kernel probably cannot either.

QuestionWhy it matters
What is the block size?All K-quant loops advance by 256-weight super-blocks.
Where are the low bits?Nibble order changes the value reconstruction.
Where are the high bits?Q5_K and Q6_K require separate high-bit handling.
How are scales packed?Q4_K uses 6-bit packed scale/min metadata.
Is there a min correction?Q4_K needs dmin × bsums correction.
Is the value centered?Q6_K requires q6 − 32.
What is the activation representation?Mixed dot products depend on Q8_K quantization.
What is the scalar oracle?Every SIMD path must match the scalar reference before benchmarking.

Format trade-off curve showing approximate bytes per weight versus fidelity intuition for Q4_0, Q4_K, Q5_K, Q6_K, and Q8_K.

Section 11: Why This Matters for CPU-First AI

On CPU, quantization is not a side feature. It is the deployment boundary. If the weights are too large, the model does not fit cleanly. If the weights fit but the format is slow to decode, the runtime loses throughput. If the decode is fast but not bit-for-bit compatible with the reference, the model becomes untrustworthy.

K-quants sit exactly at that boundary. They are compact enough to make CPU inference practical, but structured enough to preserve useful model quality. That is why CKE needs to own them at the kernel level rather than treating them as an opaque format imported from somewhere else.

My practical rule: the format is not supported until I can explain the bytes, write the scalar path, write the optimized path, and compare both against a known-good model path. “It loads” is not enough. “It matches the scalar oracle and the model parity gate” is where support begins.

Section 12: How CKE Makes Quantization Inspectable

This is where the C-Kernel-Engine IR Visualizer becomes more than a dashboard. A model filename might say Q4_K_M, but the runtime still has to inspect the actual tensor dtypes inside the artifact. A real GGUF can mix q4_k, q5_0, q6_k, q8_0, fp16, and fp32 tensors across attention, MLP, embeddings, output heads, and normalization weights.

The practical rule is simple: CKE should not trust the model-level label as the kernel contract. It should read the weights, lower the graph, build the memory plan, and make the dtype choice visible before codegen and runtime dispatch. That is why the generated ir_report.html includes a Weight Dtype Audit surface.

C-Kernel-Engine IR visualizer dtype audit excerpt showing mixed q5_0, q8_0, q6_k, q4_k, and fp32 tensors across transformer layers.

In the Qwen2 audit I am using while hardening CKE, the same model artifact shows mixed dtype behavior across layers. Some rows use q5_0 attention projections, some keep selected tensors at q8_0, MLP down projections can appear as q6_k or q4_k, and layer norms stay in fp32. That is the entire point: quantization is not a single string. It is a per-tensor execution contract.

Visualizer surfaceWhat it provesWhy it matters for K-quants
Weight Dtype AuditShows the dtype for each major tensor by layer.Prevents treating a mixed model as if every weight were the same format.
Full ChainConnects loaded tensor metadata through IR lowering and codegen.Lets the runtime verify that q4_k, q6_k, or q8_0 tensors reach the right kernel family.
Per-Layer Flow GraphShows how each layer consumes weights and activations.Makes it easier to debug whether a parity failure belongs to attention, MLP, normalization, or output projection.
Run HubIndexes generated runs and links to each ir_report.html.Turns many model experiments into a browsable audit ledger instead of scattered files.

Generate a v7 IR Visualizer report from an existing runbash
RUN=$HOME/.cache/ck-engine-v7/models/train/<run-name>

python3 version/v7/tools/open_ir_visualizer.py \
  --generate \
  --run "$RUN" \
  --html-only \
  --strict-run-artifacts \
  --output "$RUN/ir_report.html"

python3 version/v7/tools/open_ir_hub.py --open
Generate a v8 model run with the visualizer enabledbash
version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen2-0.5B-Instruct-GGUF/qwen2-0_5b-instruct-q4_k_m.gguf \
  --context-len 1024 \
  --force-compile \
  --force-convert \
  --generate-visualizer

python3 version/v8/tools/open_ir_hub_v8.py --open

That inspection loop is important for CPU-first AI because the hard bugs are rarely visible at the marketing label level. A Q4_K_M model may still preserve sensitive tensors at higher precision because model quality can collapse quickly if the wrong operation is quantized too aggressively. CKE therefore needs both: the low-level kernel knowledge from this post and the IR-level audit surface that proves which dtype each tensor actually uses.

Related ShivasNotes posts

If this is your first time looking at CKE quantized kernels, these posts give the surrounding runtime and hardware context:

Related C-Kernel-Engine docs

For the living CKE documentation behind this post, see the quantization overview, format-specific notes, and the operator pages that generate the visualizer reports:

Section 13: Summary

Q4_K teaches nested scale/min metadata and correction terms. Q5_K teaches the intermediate high-bit bookkeeping tier. Q6_K teaches low/high bit-plane reconstruction and signed recentering. Q8_K teaches why activation quantization is part of the mixed-dot hot path.

The lesson is not that one format is always best. The lesson is that each format is a contract. The byte layout, dequant math, activation path, SIMD implementation, IR dtype audit, and model-level validation must all agree. That is the work underneath CPU-first inference.

One-line mental modeltext
K-quant support =
  byte layout
  + scale hierarchy
  + packed integer reconstruction
  + mixed Q8_K activation path
  + scalar oracle
  + SIMD parity
  + model-level validation