ML systems · Quantization internals
This ShivasNotes deep dive is the granular companion to the broader Quantization Deep Dive in this series. The earlier post explained why quantization matters for CPU inference. This one opens the K-quant formats themselves: Q4_K, Q5_K, Q6_K, Q8_K, and the mixed K × Q8_K dot products that C-Kernel-Engine has to execute correctly.
The thesis is simple: K-quants are not just smaller floats. They are byte layouts, scale hierarchies, packed nibbles, sub-block metadata, correction terms, activation quantization, and strict parity contracts. If you cannot explain the bytes, you do not really own the kernel. This is where quantization turns from “divide by scale and round” into systems engineering. The math is small. The bookkeeping is the work.
What this post covers
Sections 1 through 4 explain why K-quants use 256-weight super-blocks and how Q4_K, Q5_K, Q6_K, and Q8_K store their values.
Sections 5 through 10 connect the formats to the actual C-Kernel-Engine runtime path: activation quantization into Q8_K, mixed quantized dot products, SIMD dispatch, parity bugs, and the mental model needed before writing AVX2/VNNI/NEON kernels.
Section 1: Why K-Quants Exist
Simple block quantization already works better than one scale for a whole tensor. Instead of one global scale, formats like Q4_0 use one FP16 scale for every 32 weights. That is the first important idea: preserve local dynamic range.
K-quants go one level deeper. They group 256 weights into a super-block and store nested metadata inside that super-block. The weights are still packed into low-bit integers, but the scale information becomes more structured. This lets the format reduce metadata overhead while preserving more local information than a naive single-scale block.
That is why K-quants matter for CPU inference. They reduce bytes moved from memory, but still preserve enough local detail that the model behaves like the original model. The price is that kernels become more complex: they must unpack the integers, unpack the scale hierarchy, apply correction terms, and accumulate with exactly the same semantics as the reference implementation.
The trade is not free. K-quants save memory bandwidth, but they spend instruction complexity. A CPU runtime wins only if the unpack/decode work is cheaper than moving full-precision weights through memory. 256 K-quant formats use 256-weight super-blocks. The kernel advances through the model in these layout units.

Simple 32-weight format:
Q4_0:
d - one FP16 scale
qs[16] - 32 weights, 4 bits each
K-quant 256-weight format:
Q4_K:
d - super-block scale
dmin - super-block minimum scale
scales[12] - packed sub-block scales and mins
qs[128] - 256 weights, 4 bits eachSection 2: The Format Contract
C-Kernel-Engine keeps the K-quant format contract in include/ckernel_quant.h. This matters because the model loader, memory planner, scalar reference kernel, SIMD kernel, and parity tests must all agree on the same layout.
If one layer interprets scales[12] differently from another layer, the model may still run. It will simply produce wrong logits. That is the dangerous class of quantization bug: no crash, no obvious invalid pointer, just numerically plausible nonsense.
| Format | Super-block | Metadata | Payload | Decode idea |
|---|---|---|---|---|
Q4_K | 256 weights | d, dmin, packed scales/mins | 128 bytes of 4-bit values | d × scale × q4 − dmin × min |
Q5_K | 256 weights | similar K hierarchy plus high-bit storage | 4-bit low values plus high bit | more fidelity than Q4_K, more unpack work |
Q6_K | 256 weights | 16 int8 scales + FP16 super-scale | ql[128] low bits + qh[64] high bits | d × scale × (q6 − 32) |
Q8_K | 256 values | FP32 scale + block sums | 256 signed int8 values | activation-side bridge for mixed dots |
#define QK_K 256
#define K_SCALE_SIZE 12
typedef struct {
ck_half d; /* super-block scale */
ck_half dmin; /* super-block minimum */
uint8_t scales[K_SCALE_SIZE]; /* 8 scales + 8 mins, 6-bit packed */
uint8_t qs[QK_K / 2]; /* 256 x 4-bit weights */
} block_q4_K;
typedef struct {
uint8_t ql[QK_K / 2]; /* low 4 bits */
uint8_t qh[QK_K / 4]; /* high 2 bits */
int8_t scales[QK_K / 16]; /* 16 sub-block scales */
ck_half d; /* super-block scale */
} block_q6_K;
typedef struct {
float d; /* activation scale */
int8_t qs[QK_K]; /* 256 signed int8 values */
int16_t bsums[QK_K / 16]; /* block sums for optimization */
} block_q8_K;Section 3: Q4_K — The Format That Looks Simple Until You Decode It
Q4_K stores 256 weights in 144 bytes. The obvious part is qs[128]: each byte holds two 4-bit values. The non-obvious part is scales[12], which packs eight scale values and eight minimum values into 6-bit fields.
The decode is not simply weight = q × scale. It has a scale term and a minimum-correction term. In the CKE scalar reference kernel, the dot product computes integer products weighted by the unpacked sub-block scales, then subtracts the dmin correction using Q8_K block sums.

static inline void unpack_q4_k_scales(const uint8_t *scales,
uint8_t *sc, uint8_t *m) {
sc[0] = scales[0] & 0x3F;
sc[1] = scales[1] & 0x3F;
sc[2] = scales[2] & 0x3F;
sc[3] = scales[3] & 0x3F;
m[0] = scales[4] & 0x3F;
m[1] = scales[5] & 0x3F;
m[2] = scales[6] & 0x3F;
m[3] = scales[7] & 0x3F;
sc[4] = (scales[8] & 0x0F) | ((scales[0] >> 6) << 4);
sc[5] = (scales[9] & 0x0F) | ((scales[1] >> 6) << 4);
sc[6] = (scales[10] & 0x0F) | ((scales[2] >> 6) << 4);
sc[7] = (scales[11] & 0x0F) | ((scales[3] >> 6) << 4);
m[4] = (scales[8] >> 4) | ((scales[4] >> 6) << 4);
m[5] = (scales[9] >> 4) | ((scales[5] >> 6) << 4);
m[6] = (scales[10] >> 4) | ((scales[6] >> 6) << 4);
m[7] = (scales[11] >> 4) | ((scales[7] >> 6) << 4);
} The first real lesson of Q4_K is that the scale array is not an array of bytes. It is a 12-byte bit field. Treating it like ordinary metadata is how subtle parity bugs enter the runtime. 12B The entire Q4_K scale/min hierarchy for 256 weights is packed into only twelve bytes.
Section 4: Q5_K — The Middle Tier
Q5_K sits between Q4_K and Q6_K. The idea is straightforward: spend one more bit per weight than Q4-style storage so the quantized value can represent more levels. But that one extra bit does not arrive for free. It usually means high-bit packing, extra unpack logic, and additional places for SIMD paths to disagree with scalar reference code.
In practice, Q5_K is useful as a mental bridge. If Q4_K teaches “nibbles plus nested scale/min metadata,” and Q6_K teaches “low 4 bits plus high 2 bits and signed sub-scales,” then Q5_K is the intermediate form where the kernel writer starts to see why bit planes matter.
| Format | What improves | What becomes harder |
|---|---|---|
Q4_K | Best compression among the K formats discussed here. | Scale/min correction and nibble unpacking. |
Q5_K | More quantization levels than Q4_K. | High-bit handling and parity with mixed paths. |
Q6_K | Better fidelity; cleaner signed reconstruction. | Low/high bit-plane extraction and more bytes moved. |
Q4_K:
4-bit payload
strong compression
scale/min correction is central
Q5_K:
5-bit payload
better value resolution
extra high-bit bookkeeping
Q6_K:
6-bit payload
higher fidelity
low 4 bits + high 2 bits + signed recenteringSection 5: Q6_K — Low Bits, High Bits, and the −32 Centering Step
Q6_K uses a different shape from Q4_K. It stores low 4 bits in ql[128], high 2 bits in qh[64], sixteen signed sub-block scales, and one FP16 super-block scale.
The scalar reference code reconstructs a 6-bit unsigned value, then subtracts 32 to recenter it into a signed range. That detail is not cosmetic. If the kernel forgets the −32, the dot product receives a large positive bias.

const int8_t q1 =
(int8_t)((ql[l + 0] & 0xF) |
(((qh[l] >> 0) & 3) << 4)) - 32;
const int8_t q2 =
(int8_t)((ql[l + 32] & 0xF) |
(((qh[l] >> 2) & 3) << 4)) - 32;
const int8_t q3 =
(int8_t)((ql[l + 0] >> 4) |
(((qh[l] >> 4) & 3) << 4)) - 32;
const int8_t q4 =
(int8_t)((ql[l + 32] >> 4) |
(((qh[l] >> 6) & 3) << 4)) - 32;That extraction pattern is the kind of code that looks ugly because the data layout is optimized for compactness, not for human readability. The correct implementation is the one that agrees with the reference format.
Section 6: Q8_K — Why Activations Become Quantized Too
For decode-style inference, the model weights can stay quantized on disk and in memory. But the activation vector usually arrives as FP32 or BF16. To do a mixed quantized dot product efficiently, CKE quantizes that activation row into Q8_K blocks on the fly.
This is why Q8_K appears in kernels like gemv_q4_k_q8_k and gemv_q6_k_q8_k. The left side is a compressed weight row. The right side is a temporary quantized activation representation. The dot product can then accumulate integer products and apply floating scale terms at the block boundary.

void quantize_row_q8_k(const float *x, void *vy, int k) {
#if defined(__AVX512F__) && defined(__AVX512BW__)
quantize_row_q8_k_avx512(x, vy, k);
#elif defined(__AVX2__)
quantize_row_q8_k_avx2(x, vy, k);
#elif defined(__AVX__)
quantize_row_q8_k_avx(x, vy, k);
#elif defined(__SSE4_1__)
quantize_row_q8_k_sse(x, vy, k);
#else
quantize_row_q8_k_ref(x, vy, k);
#endif
}The activation quantizer is part of the inference hot path. If it is slow, the K-quant GEMV cannot win. If it is numerically different from the reference path, the optimized kernel inherits the error. A mixed quant kernel is only as trustworthy as both halves: the weight format decoder and the activation quantizer.
Section 7: The Q4_K × Q8_K Dot Product
The Q4_K × Q8_K reference dot product shows the whole algorithm in compact form. First unpack the Q4_K scale/min fields. Then compute the product of packed Q4 values and Q8 activation values. Then apply scale terms. Then subtract the minimum correction using bsums.
The correction term is the part worth slowing down for. Because Q4_K has a min term, the dot product has to account for the contribution of that minimum across the activation block. Q8_K stores block sums partly to make that correction cheap.
uint8_t sc[8], m_val[8];
unpack_q4_k_scales(w[i].scales, sc, m_val);
const float d = CK_FP16_TO_FP32(w[i].d) * x[i].d;
const float dmin = CK_FP16_TO_FP32(w[i].dmin) * x[i].d;
int32_t aux32[8] = {0};
int sumi = 0;
for (int j = 0; j < QK_K / 16; ++j) {
sumi += (int)x[i].bsums[j] * (int)m_val[j / 2];
}
/* q4 nibble products accumulate into aux32 */
/* then apply scale and min correction */
sumf += d * aux32_sum;
sumf -= dmin * (float)sumi;This is the real difference between “I understand 4-bit quantization” and “I understand Q4_K.” The first says “unpack nibbles and multiply by scale.” The second says “unpack nibbles, unpack scale/min metadata, use activation block sums, and subtract the correction term with exactly the right indexing.”
Section 8: The Q6_K × Q8_K Dot Product
Q6_K × Q8_K has a different hot loop. There is no dmin correction term like Q4_K. Instead, the code reconstructs signed 6-bit values, multiplies each by its signed sub-block scale, multiplies by the Q8 activation, and accumulates.
That makes the Q6_K mental model cleaner in one way: weight ≈ d × scale[sub] × (q6 − 32). But the bit extraction is more complex because each group of values is split across low and high bit planes.
aux32[l & 7] += (int)sc[is + 0] * (int)q1 * (int)q8[l + 0];
aux32[l & 7] += (int)sc[is + 2] * (int)q2 * (int)q8[l + 32];
aux32[l & 7] += (int)sc[is + 4] * (int)q3 * (int)q8[l + 64];
aux32[l & 7] += (int)sc[is + 6] * (int)q4 * (int)q8[l + 96];
for (int l = 0; l < 8; ++l) {
sums[l] += d * (float)aux32[l];
}When this is vectorized, the structure becomes: unpack bit planes, recenter values, multiply by scales, multiply by Q8 activations, accumulate in int32 lanes, then apply the floating block scale. The SIMD implementation can be completely different mechanically, but it must be identical semantically.
Section 9: Dispatch Is Part of Correctness
C-Kernel-Engine keeps scalar reference implementations because they are the oracle. Optimized paths exist for AVX, AVX2, AVX-512, VNNI, SSE, and NEON depending on the kernel. But dispatch is not only about speed. It is also about debug control.
For example, CK_DEBUG_Q8K_REF can force the reference Q8_K activation quantizer. CK_DEBUG_Q6K_Q8K_REF can force the Q6_K/Q8_K reference path. Those switches matter because when a model diverges, you need to isolate whether the error is in the format decode, activation quantization, dot product, or SIMD reduction.
CK_DEBUG_Q8K_REF=1
force scalar Q8_K activation quantization
CK_DEBUG_Q6K_Q8K_REF=1
force scalar Q6_K × Q8_K dot product
Purpose:
compare scalar vs SIMD
isolate quantizer bugs from dot-product bugs
protect model bring-up with parity gates
Section 10: The Kernel Writer’s Checklist
Before writing an optimized K-quant kernel, the checklist is mechanical. This is the exact reason these posts help harden C-Kernel-Engine: if the explanation cannot survive the checklist, the kernel probably cannot either.
| Question | Why it matters |
|---|---|
| What is the block size? | All K-quant loops advance by 256-weight super-blocks. |
| Where are the low bits? | Nibble order changes the value reconstruction. |
| Where are the high bits? | Q5_K and Q6_K require separate high-bit handling. |
| How are scales packed? | Q4_K uses 6-bit packed scale/min metadata. |
| Is there a min correction? | Q4_K needs dmin × bsums correction. |
| Is the value centered? | Q6_K requires q6 − 32. |
| What is the activation representation? | Mixed dot products depend on Q8_K quantization. |
| What is the scalar oracle? | Every SIMD path must match the scalar reference before benchmarking. |

Section 11: Why This Matters for CPU-First AI
On CPU, quantization is not a side feature. It is the deployment boundary. If the weights are too large, the model does not fit cleanly. If the weights fit but the format is slow to decode, the runtime loses throughput. If the decode is fast but not bit-for-bit compatible with the reference, the model becomes untrustworthy.
K-quants sit exactly at that boundary. They are compact enough to make CPU inference practical, but structured enough to preserve useful model quality. That is why CKE needs to own them at the kernel level rather than treating them as an opaque format imported from somewhere else.
My practical rule: the format is not supported until I can explain the bytes, write the scalar path, write the optimized path, and compare both against a known-good model path. “It loads” is not enough. “It matches the scalar oracle and the model parity gate” is where support begins.
Section 12: How CKE Makes Quantization Inspectable
This is where the C-Kernel-Engine IR Visualizer becomes more than a dashboard. A model filename might say Q4_K_M, but the runtime still has to inspect the actual tensor dtypes inside the artifact. A real GGUF can mix q4_k, q5_0, q6_k, q8_0, fp16, and fp32 tensors across attention, MLP, embeddings, output heads, and normalization weights.
The practical rule is simple: CKE should not trust the model-level label as the kernel contract. It should read the weights, lower the graph, build the memory plan, and make the dtype choice visible before codegen and runtime dispatch. That is why the generated ir_report.html includes a Weight Dtype Audit surface.
In the Qwen2 audit I am using while hardening CKE, the same model artifact shows mixed dtype behavior across layers. Some rows use q5_0 attention projections, some keep selected tensors at q8_0, MLP down projections can appear as q6_k or q4_k, and layer norms stay in fp32. That is the entire point: quantization is not a single string. It is a per-tensor execution contract.
| Visualizer surface | What it proves | Why it matters for K-quants |
|---|---|---|
| Weight Dtype Audit | Shows the dtype for each major tensor by layer. | Prevents treating a mixed model as if every weight were the same format. |
| Full Chain | Connects loaded tensor metadata through IR lowering and codegen. | Lets the runtime verify that q4_k, q6_k, or q8_0 tensors reach the right kernel family. |
| Per-Layer Flow Graph | Shows how each layer consumes weights and activations. | Makes it easier to debug whether a parity failure belongs to attention, MLP, normalization, or output projection. |
| Run Hub | Indexes generated runs and links to each ir_report.html. | Turns many model experiments into a browsable audit ledger instead of scattered files. |
How to use the IR Visualizer as a CKE product surface
The operator workflow is intentionally simple: start from a model, ask CKE to convert and compile it, generate the visualizer, then open the generated run report or the run hub. The point is not only to run a model. The point is to make the model inspectable as an engineering artifact.
- Use the v7 or v8 runbook command with
--generate-visualizer. - Open the generated
ir_report.htmlinside the run directory. - Use the Weight Dtype Audit tab to inspect which tensors are
q4_k,q6_k,q8_0,fp32, or another supported dtype. - Use the Full Chain and Per-Layer Flow Graph views to connect model weights to IR lowering, memory planning, and kernel dispatch.
- Use the IR Hub when you have multiple model experiments and want one browsable ledger of generated runs.
For the full operator paths, see the v7 runbook and v8 inference runbook.
version/v7/scripts/cks-v7-run run \
hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
--context-len 1024 \
--force-compile \
--force-convert \
--generate-visualizer
# Then open the generated run report:
# $CK_CACHE_DIR/Qwen--Qwen3-0.6B-GGUF/ir_report.html
# or $HOME/.cache/ck-engine-v7/models/Qwen--Qwen3-0.6B-GGUF/ir_report.htmlRUN=$HOME/.cache/ck-engine-v7/models/train/<run-name>
python3 version/v7/tools/open_ir_visualizer.py \
--generate \
--run "$RUN" \
--html-only \
--strict-run-artifacts \
--output "$RUN/ir_report.html"
python3 version/v7/tools/open_ir_hub.py --openversion/v8/scripts/cks-v8-run run \
hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
--context-len 1024 \
--force-convert \
--force-compile \
--generate-visualizer
python3 version/v8/tools/open_ir_hub_v8.py --openThat inspection loop is important for CPU-first AI because the hard bugs are rarely visible at the marketing label level. A Q4_K_M model may still preserve sensitive tensors at higher precision because model quality can collapse quickly if the wrong operation is quantized too aggressively. CKE therefore needs both: the low-level kernel knowledge from this post and the IR-level audit surface that proves which dtype each tensor actually uses.
This is also the marketing story for CKE as a product: it is not only a runtime that tries to emit tokens. It is a toolchain that converts a model into an inspectable CPU execution artifact: weights, dtypes, IR graph, memory layout, generated code, kernel dispatch, parity data, and run reports.
Related ShivasNotes posts
If this is your first time looking at CKE quantized kernels, these posts give the surrounding runtime and hardware context:
- What Is the C-Kernel-Engine?
- Quantization Deep Dive: How CPU Kernels Compress Weights And Preserve Accuracy
- SIMD Deep Dive: How AI Kernels Use SSE, AVX, AVX-512, And VNNI
- ARM NEON In C-Kernel-Engine: Real Quantized Kernels On ARM
- Threadpools And Memory Pools: Why CKE Needs Runtime Ownership For CPU AI Kernels
- Linux System Programming For AI Kernels
- Pipeline vs Tensor Parallelism: How CKE Splits AI Across CPU Nodes
Related C-Kernel-Engine docs
For the living CKE documentation behind this post, see the quantization overview, format-specific notes, and the operator pages that generate the visualizer reports:
Section 13: Summary
Q4_K teaches nested scale/min metadata and correction terms. Q5_K teaches the intermediate high-bit bookkeeping tier. Q6_K teaches low/high bit-plane reconstruction and signed recentering. Q8_K teaches why activation quantization is part of the mixed-dot hot path.
The lesson is not that one format is always best. The lesson is that each format is a contract. The byte layout, dequant math, activation path, SIMD implementation, IR dtype audit, and model-level validation must all agree. That is the work underneath CPU-first inference.
K-quant support =
byte layout
+ scale hierarchy
+ packed integer reconstruction
+ mixed Q8_K activation path
+ scalar oracle
+ SIMD parity
+ model-level validation