K-Quants Deep Dive: Q4_K, Q5_K, Q6_K, Q8

ML systems · Quantization internals

This ShivasNotes deep dive is the granular companion to the broader Quantization Deep Dive in this series. The earlier post explained why quantization matters for CPU inference. This one opens the K-quant formats themselves: Q4_K, Q5_K, Q6_K, Q8_K, and the mixed K × Q8_K dot products that C-Kernel-Engine has to execute correctly.

The thesis is simple: K-quants are not just smaller floats. They are byte layouts, scale hierarchies, packed nibbles, sub-block metadata, correction terms, activation quantization, and strict parity contracts. If you cannot explain the bytes, you do not really own the kernel. This is where quantization turns from “divide by scale and round” into systems engineering. The math is small. The bookkeeping is the work.

What this post covers

Sections 1 through 4 explain why K-quants use 256-weight super-blocks and how Q4_K, Q5_K, Q6_K, and Q8_K store their values.

Sections 5 through 10 connect the formats to the actual C-Kernel-Engine runtime path: activation quantization into Q8_K, mixed quantized dot products, SIMD dispatch, parity bugs, and the mental model needed before writing AVX2/VNNI/NEON kernels.

Section 1: Why K-Quants Exist

Simple block quantization already works better than one scale for a whole tensor. Instead of one global scale, formats like Q4_0 use one FP16 scale for every 32 weights. That is the first important idea: preserve local dynamic range.

K-quants go one level deeper. They group 256 weights into a super-block and store nested metadata inside that super-block. The weights are still packed into low-bit integers, but the scale information becomes more structured. This lets the format reduce metadata overhead while preserving more local information than a naive single-scale block.

That is why K-quants matter for CPU inference. They reduce bytes moved from memory, but still preserve enough local detail that the model behaves like the original model. The price is that kernels become more complex: they must unpack the integers, unpack the scale hierarchy, apply correction terms, and accumulate with exactly the same semantics as the reference implementation.

The trade is not free. K-quants save memory bandwidth, but they spend instruction complexity. A CPU runtime wins only if the unpack/decode work is cheaper than moving full-precision weights through memory. 256 K-quant formats use 256-weight super-blocks. The kernel advances through the model in these layout units.

K-quant super-block diagram showing simple 32-weight block quantization versus 256-weight K-quant super-blocks with nested metadata.

Simple block quantization versus K-quant super-blocks

text

Simple 32-weight format:
  Q4_0:
    d       - one FP16 scale
    qs[16]  - 32 weights, 4 bits each

K-quant 256-weight format:
  Q4_K:
    d          - super-block scale
    dmin       - super-block minimum scale
    scales[12] - packed sub-block scales and mins
    qs[128]    - 256 weights, 4 bits each

Section 2: The Format Contract

C-Kernel-Engine keeps the K-quant format contract in include/ckernel_quant.h. This matters because the model loader, memory planner, scalar reference kernel, SIMD kernel, and parity tests must all agree on the same layout.

If one layer interprets scales[12] differently from another layer, the model may still run. It will simply produce wrong logits. That is the dangerous class of quantization bug: no crash, no obvious invalid pointer, just numerically plausible nonsense.

Format	Super-block	Metadata	Payload	Decode idea
`Q4_K`	256 weights	`d`, `dmin`, packed scales/mins	128 bytes of 4-bit values	`d × scale × q4 − dmin × min`
`Q5_K`	256 weights	similar K hierarchy plus high-bit storage	4-bit low values plus high bit	more fidelity than Q4_K, more unpack work
`Q6_K`	256 weights	16 int8 scales + FP16 super-scale	`ql[128]` low bits + `qh[64]` high bits	`d × scale × (q6 − 32)`
`Q8_K`	256 values	FP32 scale + block sums	256 signed int8 values	activation-side bridge for mixed dots

Core K-quant structs in C-Kernel-Engine

#define QK_K 256
#define K_SCALE_SIZE 12

typedef struct {
    ck_half d;                    /* super-block scale */
    ck_half dmin;                 /* super-block minimum */
    uint8_t scales[K_SCALE_SIZE]; /* 8 scales + 8 mins, 6-bit packed */
    uint8_t qs[QK_K / 2];         /* 256 x 4-bit weights */
} block_q4_K;

typedef struct {
    uint8_t ql[QK_K / 2];      /* low 4 bits */
    uint8_t qh[QK_K / 4];      /* high 2 bits */
    int8_t scales[QK_K / 16];  /* 16 sub-block scales */
    ck_half d;                 /* super-block scale */
} block_q6_K;

typedef struct {
    float d;                  /* activation scale */
    int8_t qs[QK_K];          /* 256 signed int8 values */
    int16_t bsums[QK_K / 16]; /* block sums for optimization */
} block_q8_K;

Section 3: Q4_K — The Format That Looks Simple Until You Decode It

Q4_K stores 256 weights in 144 bytes. The obvious part is qs[128]: each byte holds two 4-bit values. The non-obvious part is scales[12], which packs eight scale values and eight minimum values into 6-bit fields.

The decode is not simply weight = q × scale. It has a scale term and a minimum-correction term. In the CKE scalar reference kernel, the dot product computes integer products weighted by the unpacked sub-block scales, then subtracts the dmin correction using Q8_K block sums.

Q4_K layout with d, dmin, packed 12-byte scales field, and 128-byte packed 4-bit weight payload.

Q4_K scale/min unpacking contract

static inline void unpack_q4_k_scales(const uint8_t *scales,
                                      uint8_t *sc, uint8_t *m) {
    sc[0] = scales[0] & 0x3F;
    sc[1] = scales[1] & 0x3F;
    sc[2] = scales[2] & 0x3F;
    sc[3] = scales[3] & 0x3F;

    m[0] = scales[4] & 0x3F;
    m[1] = scales[5] & 0x3F;
    m[2] = scales[6] & 0x3F;
    m[3] = scales[7] & 0x3F;

    sc[4] = (scales[8]  & 0x0F) | ((scales[0] >> 6) << 4);
    sc[5] = (scales[9]  & 0x0F) | ((scales[1] >> 6) << 4);
    sc[6] = (scales[10] & 0x0F) | ((scales[2] >> 6) << 4);
    sc[7] = (scales[11] & 0x0F) | ((scales[3] >> 6) << 4);

    m[4] = (scales[8]  >> 4) | ((scales[4] >> 6) << 4);
    m[5] = (scales[9]  >> 4) | ((scales[5] >> 6) << 4);
    m[6] = (scales[10] >> 4) | ((scales[6] >> 6) << 4);
    m[7] = (scales[11] >> 4) | ((scales[7] >> 6) << 4);
}

The first real lesson of Q4_K is that the scale array is not an array of bytes. It is a 12-byte bit field. Treating it like ordinary metadata is how subtle parity bugs enter the runtime. 12B The entire Q4_K scale/min hierarchy for 256 weights is packed into only twelve bytes.

Section 4: Q5_K — The Middle Tier

Q5_K sits between Q4_K and Q6_K. The idea is straightforward: spend one more bit per weight than Q4-style storage so the quantized value can represent more levels. But that one extra bit does not arrive for free. It usually means high-bit packing, extra unpack logic, and additional places for SIMD paths to disagree with scalar reference code.

In practice, Q5_K is useful as a mental bridge. If Q4_K teaches “nibbles plus nested scale/min metadata,” and Q6_K teaches “low 4 bits plus high 2 bits and signed sub-scales,” then Q5_K is the intermediate form where the kernel writer starts to see why bit planes matter.

Format	What improves	What becomes harder
`Q4_K`	Best compression among the K formats discussed here.	Scale/min correction and nibble unpacking.
`Q5_K`	More quantization levels than Q4_K.	High-bit handling and parity with mixed paths.
`Q6_K`	Better fidelity; cleaner signed reconstruction.	Low/high bit-plane extraction and more bytes moved.

How to think about the Q5_K tier

text

Q4_K:
  4-bit payload
  strong compression
  scale/min correction is central

Q5_K:
  5-bit payload
  better value resolution
  extra high-bit bookkeeping

Q6_K:
  6-bit payload
  higher fidelity
  low 4 bits + high 2 bits + signed recentering

Section 5: Q6_K — Low Bits, High Bits, and the −32 Centering Step

Q6_K uses a different shape from Q4_K. It stores low 4 bits in ql[128], high 2 bits in qh[64], sixteen signed sub-block scales, and one FP16 super-block scale.

The scalar reference code reconstructs a 6-bit unsigned value, then subtracts 32 to recenter it into a signed range. That detail is not cosmetic. If the kernel forgets the −32, the dot product receives a large positive bias.

Q6_K layout showing ql low 4-bit plane, qh high 2-bit plane, sixteen int8 scales, and FP16 super-block scale.

Q6_K scalar reconstruction pattern

const int8_t q1 =
    (int8_t)((ql[l + 0] & 0xF) |
    (((qh[l] >> 0) & 3) << 4)) - 32;

const int8_t q2 =
    (int8_t)((ql[l + 32] & 0xF) |
    (((qh[l] >> 2) & 3) << 4)) - 32;

const int8_t q3 =
    (int8_t)((ql[l + 0] >> 4) |
    (((qh[l] >> 4) & 3) << 4)) - 32;

const int8_t q4 =
    (int8_t)((ql[l + 32] >> 4) |
    (((qh[l] >> 6) & 3) << 4)) - 32;

That extraction pattern is the kind of code that looks ugly because the data layout is optimized for compactness, not for human readability. The correct implementation is the one that agrees with the reference format.

Section 6: Q8_K — Why Activations Become Quantized Too

For decode-style inference, the model weights can stay quantized on disk and in memory. But the activation vector usually arrives as FP32 or BF16. To do a mixed quantized dot product efficiently, CKE quantizes that activation row into Q8_K blocks on the fly.

This is why Q8_K appears in kernels like gemv_q4_k_q8_k and gemv_q6_k_q8_k. The left side is a compressed weight row. The right side is a temporary quantized activation representation. The dot product can then accumulate integer products and apply floating scale terms at the block boundary.

Mixed quantized dot path where Q4_K or Q6_K weights multiply Q8_K activations.

Q8_K activation quantization path

void quantize_row_q8_k(const float *x, void *vy, int k) {
#if defined(__AVX512F__) && defined(__AVX512BW__)
    quantize_row_q8_k_avx512(x, vy, k);
#elif defined(__AVX2__)
    quantize_row_q8_k_avx2(x, vy, k);
#elif defined(__AVX__)
    quantize_row_q8_k_avx(x, vy, k);
#elif defined(__SSE4_1__)
    quantize_row_q8_k_sse(x, vy, k);
#else
    quantize_row_q8_k_ref(x, vy, k);
#endif
}

The activation quantizer is part of the inference hot path. If it is slow, the K-quant GEMV cannot win. If it is numerically different from the reference path, the optimized kernel inherits the error. A mixed quant kernel is only as trustworthy as both halves: the weight format decoder and the activation quantizer.

Section 7: The Q4_K × Q8_K Dot Product

The Q4_K × Q8_K reference dot product shows the whole algorithm in compact form. First unpack the Q4_K scale/min fields. Then compute the product of packed Q4 values and Q8 activation values. Then apply scale terms. Then subtract the minimum correction using bsums.

The correction term is the part worth slowing down for. Because Q4_K has a min term, the dot product has to account for the contribution of that minimum across the activation block. Q8_K stores block sums partly to make that correction cheap.

Q4_K × Q8_K scalar reference shape

uint8_t sc[8], m_val[8];
unpack_q4_k_scales(w[i].scales, sc, m_val);

const float d = CK_FP16_TO_FP32(w[i].d) * x[i].d;
const float dmin = CK_FP16_TO_FP32(w[i].dmin) * x[i].d;

int32_t aux32[8] = {0};
int sumi = 0;
for (int j = 0; j < QK_K / 16; ++j) {
    sumi += (int)x[i].bsums[j] * (int)m_val[j / 2];
}

/* q4 nibble products accumulate into aux32 */
/* then apply scale and min correction */
sumf += d * aux32_sum;
sumf -= dmin * (float)sumi;

This is the real difference between “I understand 4-bit quantization” and “I understand Q4_K.” The first says “unpack nibbles and multiply by scale.” The second says “unpack nibbles, unpack scale/min metadata, use activation block sums, and subtract the correction term with exactly the right indexing.”

Section 8: The Q6_K × Q8_K Dot Product

Q6_K × Q8_K has a different hot loop. There is no dmin correction term like Q4_K. Instead, the code reconstructs signed 6-bit values, multiplies each by its signed sub-block scale, multiplies by the Q8 activation, and accumulates.

That makes the Q6_K mental model cleaner in one way: weight ≈ d × scale[sub] × (q6 − 32). But the bit extraction is more complex because each group of values is split across low and high bit planes.

Q6_K × Q8_K reference accumulation

aux32[l & 7] += (int)sc[is + 0] * (int)q1 * (int)q8[l + 0];
aux32[l & 7] += (int)sc[is + 2] * (int)q2 * (int)q8[l + 32];
aux32[l & 7] += (int)sc[is + 4] * (int)q3 * (int)q8[l + 64];
aux32[l & 7] += (int)sc[is + 6] * (int)q4 * (int)q8[l + 96];

for (int l = 0; l < 8; ++l) {
    sums[l] += d * (float)aux32[l];
}

When this is vectorized, the structure becomes: unpack bit planes, recenter values, multiply by scales, multiply by Q8 activations, accumulate in int32 lanes, then apply the floating block scale. The SIMD implementation can be completely different mechanically, but it must be identical semantically.

Section 9: Dispatch Is Part of Correctness

C-Kernel-Engine keeps scalar reference implementations because they are the oracle. Optimized paths exist for AVX, AVX2, AVX-512, VNNI, SSE, and NEON depending on the kernel. But dispatch is not only about speed. It is also about debug control.

For example, CK_DEBUG_Q8K_REF can force the reference Q8_K activation quantizer. CK_DEBUG_Q6K_Q8K_REF can force the Q6_K/Q8_K reference path. Those switches matter because when a model diverges, you need to isolate whether the error is in the format decode, activation quantization, dot product, or SIMD reduction.

Debug switches make the optimized path falsifiable

text

CK_DEBUG_Q8K_REF=1
  force scalar Q8_K activation quantization

CK_DEBUG_Q6K_Q8K_REF=1
  force scalar Q6_K × Q8_K dot product

Purpose:
  compare scalar vs SIMD
  isolate quantizer bugs from dot-product bugs
  protect model bring-up with parity gates

Common K-quant bug surfaces: nibble order, scale unpacking, dmin correction, recentering, block stride, and SIMD parity.

Section 10: The Kernel Writer’s Checklist

Before writing an optimized K-quant kernel, the checklist is mechanical. This is the exact reason these posts help harden C-Kernel-Engine: if the explanation cannot survive the checklist, the kernel probably cannot either.

Question	Why it matters
What is the block size?	All K-quant loops advance by 256-weight super-blocks.
Where are the low bits?	Nibble order changes the value reconstruction.
Where are the high bits?	Q5_K and Q6_K require separate high-bit handling.
How are scales packed?	Q4_K uses 6-bit packed scale/min metadata.
Is there a min correction?	Q4_K needs `dmin × bsums` correction.
Is the value centered?	Q6_K requires `q6 − 32`.
What is the activation representation?	Mixed dot products depend on `Q8_K` quantization.
What is the scalar oracle?	Every SIMD path must match the scalar reference before benchmarking.

Format trade-off curve showing approximate bytes per weight versus fidelity intuition for Q4_0, Q4_K, Q5_K, Q6_K, and Q8_K.

Section 11: Why This Matters for CPU-First AI

On CPU, quantization is not a side feature. It is the deployment boundary. If the weights are too large, the model does not fit cleanly. If the weights fit but the format is slow to decode, the runtime loses throughput. If the decode is fast but not bit-for-bit compatible with the reference, the model becomes untrustworthy.

K-quants sit exactly at that boundary. They are compact enough to make CPU inference practical, but structured enough to preserve useful model quality. That is why CKE needs to own them at the kernel level rather than treating them as an opaque format imported from somewhere else.

My practical rule: the format is not supported until I can explain the bytes, write the scalar path, write the optimized path, and compare both against a known-good model path. “It loads” is not enough. “It matches the scalar oracle and the model parity gate” is where support begins.

Section 12: How CKE Makes Quantization Inspectable

This is where the C-Kernel-Engine IR Visualizer becomes more than a dashboard. A model filename might say Q4_K_M, but the runtime still has to inspect the actual tensor dtypes inside the artifact. A real GGUF can mix q4_k, q5_0, q6_k, q8_0, fp16, and fp32 tensors across attention, MLP, embeddings, output heads, and normalization weights.

The practical rule is simple: CKE should not trust the model-level label as the kernel contract. It should read the weights, lower the graph, build the memory plan, and make the dtype choice visible before codegen and runtime dispatch. That is why the generated ir_report.html includes a Weight Dtype Audit surface.

C-Kernel-Engine IR visualizer dtype audit excerpt showing mixed q5_0, q8_0, q6_k, q4_k, and fp32 tensors across transformer layers.

In the Qwen2 audit I am using while hardening CKE, the same model artifact shows mixed dtype behavior across layers. Some rows use q5_0 attention projections, some keep selected tensors at q8_0, MLP down projections can appear as q6_k or q4_k, and layer norms stay in fp32. That is the entire point: quantization is not a single string. It is a per-tensor execution contract.

Visualizer surface	What it proves	Why it matters for K-quants
Weight Dtype Audit	Shows the dtype for each major tensor by layer.	Prevents treating a mixed model as if every weight were the same format.
Full Chain	Connects loaded tensor metadata through IR lowering and codegen.	Lets the runtime verify that `q4_k`, `q6_k`, or `q8_0` tensors reach the right kernel family.
Per-Layer Flow Graph	Shows how each layer consumes weights and activations.	Makes it easier to debug whether a parity failure belongs to attention, MLP, normalization, or output projection.
Run Hub	Indexes generated runs and links to each `ir_report.html`.	Turns many model experiments into a browsable audit ledger instead of scattered files.

How to use the IR Visualizer as a CKE product surface

The operator workflow is intentionally simple: start from a model, ask CKE to convert and compile it, generate the visualizer, then open the generated run report or the run hub. The point is not only to run a model. The point is to make the model inspectable as an engineering artifact.

Use the v7 or v8 runbook command with --generate-visualizer.
Open the generated ir_report.html inside the run directory.
Use the Weight Dtype Audit tab to inspect which tensors are q4_k, q6_k, q8_0, fp32, or another supported dtype.
Use the Full Chain and Per-Layer Flow Graph views to connect model weights to IR lowering, memory planning, and kernel dispatch.
Use the IR Hub when you have multiple model experiments and want one browsable ledger of generated runs.

For the full operator paths, see the v7 runbook and v8 inference runbook.

v7 front door: GGUF to IR Visualizer to chat

bash

version/v7/scripts/cks-v7-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 \
  --force-compile \
  --force-convert \
  --generate-visualizer

# Then open the generated run report:
# $CK_CACHE_DIR/Qwen--Qwen3-0.6B-GGUF/ir_report.html
# or $HOME/.cache/ck-engine-v7/models/Qwen--Qwen3-0.6B-GGUF/ir_report.html

v7 lower-level path: refresh an existing run report

bash

RUN=$HOME/.cache/ck-engine-v7/models/train/<run-name>

python3 version/v7/tools/open_ir_visualizer.py \
  --generate \
  --run "$RUN" \
  --html-only \
  --strict-run-artifacts \
  --output "$RUN/ir_report.html"

python3 version/v7/tools/open_ir_hub.py --open

v8 front door: run, compile, and generate the visualizer

bash

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 \
  --force-convert \
  --force-compile \
  --generate-visualizer

python3 version/v8/tools/open_ir_hub_v8.py --open

That inspection loop is important for CPU-first AI because the hard bugs are rarely visible at the marketing label level. A Q4_K_M model may still preserve sensitive tensors at higher precision because model quality can collapse quickly if the wrong operation is quantized too aggressively. CKE therefore needs both: the low-level kernel knowledge from this post and the IR-level audit surface that proves which dtype each tensor actually uses.

This is also the marketing story for CKE as a product: it is not only a runtime that tries to emit tokens. It is a toolchain that converts a model into an inspectable CPU execution artifact: weights, dtypes, IR graph, memory layout, generated code, kernel dispatch, parity data, and run reports.

Related C-Kernel-Engine docs

For the living CKE documentation behind this post, see the quantization overview, format-specific notes, and the operator pages that generate the visualizer reports:

Section 13: Summary

Q4_K teaches nested scale/min metadata and correction terms. Q5_K teaches the intermediate high-bit bookkeeping tier. Q6_K teaches low/high bit-plane reconstruction and signed recentering. Q8_K teaches why activation quantization is part of the mixed-dot hot path.

The lesson is not that one format is always best. The lesson is that each format is a contract. The byte layout, dequant math, activation path, SIMD implementation, IR dtype audit, and model-level validation must all agree. That is the work underneath CPU-first inference.

One-line mental model

text

K-quant support =
  byte layout
  + scale hierarchy
  + packed integer reconstruction
  + mixed Q8_K activation path
  + scalar oracle
  + SIMD parity
  + model-level validation

K-Quants Deep Dive: Q4_K, Q5_K, Q6_K, Q8_K And Mixed Dot Products

What this post covers

Section 1: Why K-Quants Exist

Section 2: The Format Contract

Section 3: Q4_K — The Format That Looks Simple Until You Decode It

Section 4: Q5_K — The Middle Tier

Section 5: Q6_K — Low Bits, High Bits, and the −32 Centering Step

Section 6: Q8_K — Why Activations Become Quantized Too

Section 7: The Q4_K × Q8_K Dot Product

Section 8: The Q6_K × Q8_K Dot Product

Section 9: Dispatch Is Part of Correctness

Section 10: The Kernel Writer’s Checklist

Section 11: Why This Matters for CPU-First AI

Section 12: How CKE Makes Quantization Inspectable

How to use the IR Visualizer as a CKE product surface

Related ShivasNotes posts

Related C-Kernel-Engine docs

Section 13: Summary

ShivasNotes

Explore

Connect

K-Quants Deep Dive: Q4_K, Q5_K, Q6_K, Q8_K And Mixed Dot Products

What this post covers

Section 1: Why K-Quants Exist

Section 2: The Format Contract

Section 3: Q4_K — The Format That Looks Simple Until You Decode It

Section 4: Q5_K — The Middle Tier

Section 5: Q6_K — Low Bits, High Bits, and the −32 Centering Step

Section 6: Q8_K — Why Activations Become Quantized Too

Section 7: The Q4_K × Q8_K Dot Product

Section 8: The Q6_K × Q8_K Dot Product

Section 9: Dispatch Is Part of Correctness

Section 10: The Kernel Writer’s Checklist

Section 11: Why This Matters for CPU-First AI

Section 12: How CKE Makes Quantization Inspectable

How to use the IR Visualizer as a CKE product surface

Related ShivasNotes posts

Related C-Kernel-Engine docs

Section 13: Summary

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect