KV Cache Memory: The Hidden State That Makes LLM Decode Work

KV cache memory for CPU-native inference

This ShivasNotes deep dive is written for engineers who want to understand the single largest memory consumer in autoregressive LLM inference: the KV cache. Without it, every decode token recomputes all keys and values from scratch — O(T²) work. With it, inference becomes O(T) per token, but you trade compute for memory. C-Kernel-Engine handles KV cache with deterministic memory planning, static pointer arithmetic, GQA-aware sizing, FP16 compression, and sliding-window eviction — all decided at IR build time, not at runtime. This post walks every detail with real artifacts from Qwen3-0.6B. Video walkthrough on youtube.com/@antshivrobotics.

KV cache is the cleanest example of an inference trade every systems team eventually rediscovers: stop recomputing history, start storing history. The algorithmic win is obvious. The systems consequence is harsher. Every generated token now drags a persistent memory footprint behind it, and decode performance becomes a conversation about memory hierarchy, not just FLOPs. The flash-attention post explained why attention avoids materializing the score matrix. The CPU performance post showed decode running at arithmetic intensity ≈ 0.031 FLOPs/byte. KV cache is the bridge between those two facts.

What this post covers

The opening part explains the basic economics of cached attention: why autoregressive decode needs a history buffer, how the tensor is sized, why GQA matters, and why CKE chooses a layer-major layout with head-major access inside each layer.

The middle part moves from concept to implementation: the full 191-line kv_cache_kernels.c file, the exact generated C emitted for layer 0 in Qwen3-0.6B, FP16 cache paths, ephemeral vision attention, and the separate prefill versus decode insertion logic in build_ir_v7.py.

The closing part zooms back out to sliding-window eviction, deterministic memory planning, long-context scaling, what silicon vendors should notice in the access pattern, and how the surrounding CKE posts fit around this memory story.

Introduction — The Memory That Makes Autoregressive Inference Possible

Autoregressive generation emits one token at a time. Token t depends on tokens 1..t, so the model cannot solve the entire decode trajectory in one dense matrix multiply the way prefill can. It must repeatedly ask the same question: given one fresh query vector, what should this token attend to in all prior positions?

Without a cache, the dumb strategy is to recompute every historical key and every historical value on every decode step. The query for token t is cheap, but regenerating K_1..K_t and V_1..V_t again and again makes total work grow quadratically with sequence length. The model keeps paying rent on the same history.

With a KV cache, each token pays once. Compute K_t and V_t, write them to the cache, and from then on only read them. Attention still scans the full history, so decode is not free, but the expensive projection work for old tokens disappears.

This is why decode feels different from prefill on CPU. Prefill is dominated by GEMM-style math. Decode is dominated by memory traffic: read weights for the new projections, then stream cached K and V back through attention. In the real Qwen3-0.6B artifact, the cache alone is 224 MB, or about 24.3% of the full 922 MB activation arena. 224 MBQwen3-0.6B allocates 234,881,024 bytes for KV cache inside a 966,807,552-byte activation layout. That one buffer is almost one quarter of all live activation memory.

Autoregressive decode timeline showing each new token appending K and V to a persistent cache while attention reads the full history.

Naive decode without KV cache — every step rebuilds historical K and V

python

for t in range(T):
    q_t = x_t @ W_Q
    keys = []
    values = []
    for i in range(t + 1):
        keys.append(x_i @ W_K)
        values.append(x_i @ W_V)
    out_t = attention(q_t, keys, values)

Cached decode — compute new K/V once, append, then reuse

python

for t in range(T):
    q_t = x_t @ W_Q
    k_t = x_t @ W_K
    v_t = x_t @ W_V
    kv_cache_k[t] = k_t
    kv_cache_v[t] = v_t
    out_t = attention(q_t, kv_cache_k[:t+1], kv_cache_v[:t+1])

Real Qwen3-0.6B memory summary from layout_decode.map

text

MEMORY SUMMARY
--------------------------------------------------------------------------------
  Total:                1,606,394,890 bytes  (1.50 GB)
  Weights:                639,587,338 bytes  (610.0 MB)
  Activations:            966,807,552 bytes  (922.0 MB)

The Math — Why KV Cache Works

Standard attention starts with three projections. The hidden state matrix X is multiplied by W_Q, W_K, and W_V to produce query, key, and value tensors. The output is softmax(Q·Kᵀ / √d) · V.

During prefill, that is fine. The whole prompt already exists, so computing all Q, K, and V in a single dense pass is exactly what GEMM is good at. The runtime can fill the cache once for the prompt and move on.

Decode is the asymmetry. At step t, you only have one fresh token embedding x_t. So Q_t is one vector, but attention still needs the entire set of keys and values accumulated so far. That leaves only two choices: recompute old K/V or store old K/V.

The cache stores a very boring tensor: K[layer][head][position][head_dim] and V[layer][head][position][head_dim]. The whole trick is that boring tensors win systems wars. Once K and V become stable addresses, the runtime can schedule reads deterministically. KV cache is not an approximation. It is exact attention with the historical K and V tensors memoized instead of regenerated.

Attention equations and the generic KV-cache cost formula

text

Q = X · W_Q
K = X · W_K
V = X · W_V
Attn(X) = softmax(Q · K^T / sqrt(d)) · V
KV cache cost = 2 × num_layers × num_kv_heads × seq_len × head_dim × bytes_per_element

Decode thought experiment — recomputing historical projections is the waste

text

At token t:
  Q_t = x_t · W_Q
  Option A: recompute K_i = x_i · W_K and V_i = x_i · W_V for all i in [1, t]
  Option B: read K_1..K_t and V_1..V_t from cache
Option B is the KV cache.

What the cache actually stores on each decode step

text

write:
  K_new -> kv_cache[layer][0][kv_head][pos][:]
  V_new -> kv_cache[layer][1][kv_head][pos][:]
read:
  attention(q_t,
            kv_cache[layer][0][:][:pos+1][:],
            kv_cache[layer][1][:][:pos+1][:])

The Memory Math — How Big Is It?

The sizing formula is simple enough to memorize and important enough that you should. There are two tensors per layer — one for keys and one for values. Each tensor holds num_kv_heads × seq_len × head_dim elements. Multiply by the number of layers and the bytes per element and the bill appears immediately.

For Qwen3-0.6B in FP32, the exact calculation is: 28 × 2 × 8 × 1024 × 128 × 4 = 234,881,024 bytes. In binary units that is exactly 224 MiB. This is not a hypothetical. It is the buffer in the real decode artifact.

Larger models turn this into the long-context serving problem. The exact per-request GQA numbers are already large. Multi-user serving, replica duplication, and orchestration overhead then multiply them again. That is why operators talk about KV cache first and model weights second when context windows get long.

This is also why serving conversations often quote scarier headline figures than the raw tensor itself. Once replication and concurrency are folded in, people start talking in round numbers like 16 GB for an 8B-class deployment, 40 GB for a 70B-class deployment, and roughly 504 GB at the 405B / 128K extreme. The table below keeps the single-request tensor honest; the fleet number is whatever multiplier your serving system imposes on top.

Model	Layers	KV heads	Context	Head dim	FP32 size	FP16 size
Qwen3-0.6B	28	8	1,024	128	224 MB	112 MB
Llama-3-8B	32	8	8,192	128	2.0 GB	1.0 GB
Llama-3-70B	80	8	8,192	128	5.0 GB	2.5 GB
Llama-3-405B	126	8	131,072	128	126 GB	63 GB

If you hear scarier slide-deck numbers like 16 GB, 40 GB, or 504 GB, those are usually not the raw single-request tensor. They are serving footprints after multiplying by users, replicas, or wider deployment assumptions. The formula itself stays linear and honest. CKE keeps the honest number visible because the IR builder sizes the exact buffer before codegen. There is no hidden allocator later quietly changing the answer.

KV cache scaling chart showing linear growth with layers, context length, head dimension, and bytes per element.

Real build_ir_v7.py sizing logic for the persistent KV buffer

python

    # KV cache + RoPE
    kv_per_layer = num_kv_heads * context_len * head_dim * 4
    total_kv_size = num_layers * 2 * kv_per_layer
    add("kv_cache", total_kv_size, f"[{num_layers}, 2, {num_kv_heads}, {context_len}, {head_dim}]")

Concrete Qwen3-0.6B calculation in Python

python

num_layers = 28
num_kv_heads = 8
context_len = 1024
head_dim = 128
bytes_per_element = 4
kv_per_layer = num_kv_heads * context_len * head_dim * bytes_per_element
total_kv_size = num_layers * 2 * kv_per_layer
print(total_kv_size)  # 234881024 bytes
print(total_kv_size / 1024 / 1024)  # 224.0 MiB

Formula view — why every parameter matters

python

bytes = 2 * num_layers * num_kv_heads * seq_len * head_dim * bytes_per_element
# Qwen3-0.6B decode cache
bytes = 2 * 28 * 8 * 1024 * 128 * 4
#      = 234,881,024 bytes
#      = 224 MiB

Serving-footprint caveat — exact tensor first, fleet math second

python

# Raw per-request GQA footprint is the tensor itself.
# Serving footprint is larger when you multiply by concurrent users,
# replicas, prefill staging, or multi-device duplication.

GQA — Grouped Query Attention as KV Cache Optimization

Multi-head attention does not require the number of KV heads to equal the number of query heads. That freedom is what makes Grouped Query Attention so important. If multiple query heads can share one KV head, the cache shrinks immediately with only modest quality cost.

In plain MHA, num_kv_heads = num_heads. In GQA, num_kv_heads < num_heads. In MQA, there is just one KV head shared by every query head. The ladder is really a memory ladder.

Qwen3-0.6B uses 16 query heads and 8 KV heads. That means each KV head serves two query heads. Relative to a 16-KV-head MHA design, the cache is cut in half before any quantization or eviction trick is applied.

Type	KV-head rule	Memory ratio	Quality / behavior
MHA	num_kv_heads = num_heads	1.0×	highest
GQA	num_kv_heads < num_heads	H_kv / H_q	near-MHA, much cheaper
MQA	num_kv_heads = 1	1 / num_heads	lowest memory, more quality risk

GQA is the architectural optimization that makes all downstream KV-cache work less desperate. Once the model family chooses fewer KV heads, the memory planner, pointer arithmetic, and decode kernels all get the smaller tensor automatically. 2× smallerQwen3-0.6B stores 8 KV heads for 16 query heads. An MHA version with 16 KV heads would double the cache from 224 MiB to 448 MiB.

Comparison of MHA, GQA, and MQA showing how fewer KV heads reduce cache memory while query heads still fan out.

Real Qwen3-0.6B config excerpt from layout_decode.map

text

CONFIGURATION
--------------------------------------------------------------------------------
  model_type               qwen3
  embed_dim                1024
  num_heads                16
  num_kv_heads             8
  head_dim                 128
  intermediate_dim         3072
  num_layers               28
  vocab_size               151936
  max_seq_len              1024
  context_len              1024
  rope_theta               1000000.0
  rms_norm_eps             9.999999974752427e-07

Head broadcasting rule used by GQA attention kernels

    const size_t kv_head_stride = (size_t)kv_stride_tokens * (size_t)aligned_head_dim;
    for (int h = 0; h < num_heads; ++h) {
        int kv_head = (int)((long long)h * (long long)num_kv_heads / (long long)num_heads);
        const float *k_head = k + (size_t)kv_head * kv_head_stride;
        const float *v_head = v + (size_t)kv_head * kv_head_stride;

Same idea in simplified pseudocode

python

num_heads = 16
num_kv_heads = 8
group_size = num_heads // num_kv_heads
for h in range(num_heads):
    kv_head = h // group_size
    read_k = kv_cache_k[kv_head]
    read_v = kv_cache_v[kv_head]

Layout — Layer-Major with Head-Major Access

Layout is where a memory theory becomes a performance theory. CKE names the contract layer_major_kv_cache because each layer only ever wants its own K and V slice. Keeping a whole layer contiguous makes the generated pointer arithmetic trivial and keeps unrelated layers from interleaving.

Inside each layer, the practical access pattern is head-major. Decode attention walks one head at a time across all positions, so the cache wants [kv_head, seq, dim] contiguity. That produces long sequential reads instead of a head-interleaved stride mess.

The alternative [seq, kv_head, dim] layout is not impossible, but it makes a per-head scan uglier because every token step interleaves different heads. CKE avoids that by fixing the contract in the template and lowering directly to that shape.

Layout type	Template / shape	Use case	Why it exists
`layer_major_kv_cache`	`[layers, 2, kv_heads, seq, dim]`	autoregressive decoders	Layer-local reads are contiguous; K and V are deterministic offsets.
`head-major within layer`	`[kv_heads, seq, dim]`	decode attention kernels	A head scans its own history sequentially.
`position-major`	`[seq, kv_heads, dim]`	rare in CKE decode	Worse for head-at-a-time scans because heads are interleaved.
`ephemeral_full_context`	no persistent cache	vision encoders	Compute K/V, use once, discard.

The key phrase is deterministic addressability. Once the layout is fixed at IR build time, codegen never has to negotiate where layer 17 head 3 position 912 lives. It is a closed-form offset. Layer-major outside, head-major inside: that is the exact blend you want for a decoder where attention is layer-local but scans head histories sequentially.

Layer-major KV cache layout with contiguous K and V slices for each layer and head-major history inside the layer block.

Qwen3 template declares the KV-cache layout explicitly

json

{
  "rope_layout": "split",
  "rope_type": "rope",
  "qk_norm": true,
  "kv_layout": "layer_major_kv_cache",
  "attn_variant": "dense",
  "train_runtime_contract": {
    "saved_tensor_kernel_overrides": {
      "attn_weights": "attention_forward_causal_head_major_gqa_exact"
    }
  }
}

Real memory-map line for the cache tensor

text

0x000000C05000 0x00000EC05000  234,881,024  ( 224.00 MB)  kv_cache                 [28, 2, 8, 1024, 128]

Index math for layer-major KV cache with head-major history

python

# layer-major, then K/V selector, then head-major history
index = (((layer * 2 + kv_or_v) * num_kv_heads + kv_head) * max_seq_len + pos) * head_dim + d
# kv_or_v = 0 for K, 1 for V

The KV Cache Kernels — 191 Lines of C

One of the best things about this part of CKE is how small it is. kv_cache_kernels.c is 191 lines long. There is no framework fog here, just explicit helpers for writing, repacking, and optionally compressing the cache.

The file opens with a tiny FP32→FP16 row helper, then defines four KV-oriented routines. kv_cache_repack_head_major_inplace() handles in-place movement when capacity changes. kv_cache_write_head_major() does the real copy loop. kv_cache_store() is the FP32 wrapper, and kv_cache_store_f16() does the compressed write path.

The implementation style matters. The writer is explicit instead of hiding behind opaque memcpy everywhere because alignment and zero-padding need to stay obvious. That makes the code teachable and auditable.

Function	Purpose	Dtype path	Line count
`kv_cache_repack_head_major_inplace`	Move head blocks when effective capacity changes	FP32	32
`kv_cache_write_head_major`	Core explicit writer for one token across all KV heads	FP32	41
`kv_cache_store`	Thin decode wrapper that writes FP32 K/V into cache	FP32	20
`kv_cache_store_f16`	Convert FP32 scratch to FP16 on write	FP16 cache	47
`logits_copy_to_position`	Positioned logits copy helper in the same source file	FP32	14

Showing the full source is useful because it proves how little magic there is. The complexity of KV cache in CKE is mostly in planning, layout, and code generation. The write kernels themselves are intentionally plain C. 191 linesThe entire file is short enough to audit in one sitting. That is exactly what you want for a persistent-memory primitive that every decode token touches.

kv_cache_kernels.c — file header, helper, includes, and kernel contract

/**
 * @file kv_cache_kernels.c
 * @brief KV-cache helper kernels (head-major layout)
 *
 * CK-ENGINE KERNEL RULES:
 * =======================
 * 1. NO malloc/free - memory via bump allocator, pointers passed in
 * 2. NO OpenMP - parallelization at orchestrator/codegen layer
 * 3. API must define: inputs, outputs, workspace, and memory layouts
 * 4. Pure computation - deterministic, no side effects
 *
 * After changes: make test && make llamacpp-parity-full
 *
 * Small, explicit helpers used by the runtime/orchestrator to maintain
 * per-layer KV caches during autoregressive decoding.
 *
 * Layout:
 *   k_cache[kv_head, token, aligned_head_dim]
 *   v_cache[kv_head, token, aligned_head_dim]
 * with contiguous row-major storage and stride aligned_head_dim.
 */
#include "ckernel_engine.h"
#include <stddef.h>
#include <string.h>
static inline void ck_local_fp32_to_fp16_row(const float *src, uint16_t *dst, int n)
{
    if (!src || !dst || n <= 0) {
        return;
    }
    for (int i = 0; i < n; ++i) {
        dst[i] = CK_FP32_TO_FP16(src[i]);
    }
}

kv_cache_kernels.c — kv_cache_repack_head_major_inplace()

void kv_cache_repack_head_major_inplace(float *buf,
                                        int num_heads,
                                        int tokens,
                                        int cache_capacity,
                                        int aligned_head_dim)
{
    if (!buf) {
        return;
    }
    if (num_heads <= 0 || tokens <= 0 || cache_capacity <= 0 || aligned_head_dim <= 0) {
        return;
    }
    if (tokens > cache_capacity) {
        tokens = cache_capacity;
    }
    if (tokens == cache_capacity) {
        return;
    }
    const size_t old_head_stride = (size_t)tokens * (size_t)aligned_head_dim;
    const size_t new_head_stride = (size_t)cache_capacity * (size_t)aligned_head_dim;
    const size_t bytes = (size_t)tokens * (size_t)aligned_head_dim * sizeof(float);
    // Move head blocks from high to low to avoid overwriting source data
    // for heads that have not yet been moved.
    for (int h = num_heads - 1; h >= 0; --h) {
        float *src = buf + (size_t)h * old_head_stride;
        float *dst = buf + (size_t)h * new_head_stride;
        memmove(dst, src, bytes);
    }
}

kv_cache_kernels.c — kv_cache_write_head_major()

void kv_cache_write_head_major(const float *__restrict k_token,
                               const float *__restrict v_token,
                               float *__restrict k_cache,
                               float *__restrict v_cache,
                               int num_kv_heads,
                               int token_index,
                               int cache_capacity,
                               int head_dim,
                               int aligned_head_dim)
{
    if (!k_token || !v_token || !k_cache || !v_cache) {
        return;
    }
    if (num_kv_heads <= 0 || token_index < 0 || cache_capacity <= 0) {
        return;
    }
    if (token_index >= cache_capacity || head_dim <= 0 || aligned_head_dim <= 0) {
        return;
    }
    const size_t head_stride = (size_t)cache_capacity * (size_t)aligned_head_dim;
    const size_t token_stride = (size_t)aligned_head_dim;
    for (int h = 0; h < num_kv_heads; ++h) {
        const float *k_src = k_token + (size_t)h * token_stride;
        const float *v_src = v_token + (size_t)h * token_stride;
        float *k_dst = k_cache + (size_t)h * head_stride + (size_t)token_index * token_stride;
        float *v_dst = v_cache + (size_t)h * head_stride + (size_t)token_index * token_stride;
        for (int d = 0; d < head_dim; ++d) {
            k_dst[d] = k_src[d];
            v_dst[d] = v_src[d];
        }
        for (int d = head_dim; d < aligned_head_dim; ++d) {
            k_dst[d] = 0.0f;
            v_dst[d] = 0.0f;
        }
    }
}

kv_cache_kernels.c — kv_cache_store()

void kv_cache_store(float *__restrict kv_cache_k,
                    float *__restrict kv_cache_v,
                    const float *__restrict k,
                    const float *__restrict v,
                    int layer,
                    int pos,
                    int num_kv_heads,
                    int head_dim,
                    int max_seq_len)
{
    (void)layer;
    kv_cache_write_head_major(k, v,
                              kv_cache_k, kv_cache_v,
                              num_kv_heads,
                              pos,
                              max_seq_len,
                              head_dim,
                              head_dim);
}

kv_cache_kernels.c — kv_cache_store_f16()

void kv_cache_store_f16(uint16_t *__restrict kv_cache_k,
                        uint16_t *__restrict kv_cache_v,
                        const float *__restrict k,
                        const float *__restrict v,
                        int layer,
                        int pos,
                        int num_kv_heads,
                        int head_dim,
                        int max_seq_len)
{
    (void)layer;
    if (!kv_cache_k || !kv_cache_v || !k || !v) {
        return;
    }
    if (num_kv_heads <= 0 || pos < 0 || head_dim <= 0 || max_seq_len <= 0) {
        return;
    }
    if (pos >= max_seq_len) {
        return;
    }
    const size_t head_stride = (size_t)max_seq_len * (size_t)head_dim;
    const size_t token_stride = (size_t)head_dim;
    for (int h = 0; h < num_kv_heads; ++h) {
        const float *k_src = k + (size_t)h * token_stride;
        const float *v_src = v + (size_t)h * token_stride;
        uint16_t *k_dst = kv_cache_k + (size_t)h * head_stride + (size_t)pos * token_stride;
        uint16_t *v_dst = kv_cache_v + (size_t)h * head_stride + (size_t)pos * token_stride;
        ck_local_fp32_to_fp16_row(k_src, k_dst, head_dim);
        ck_local_fp32_to_fp16_row(v_src, v_dst, head_dim);
    }
}
/**
 * @brief Copy logits to position-indexed location in output buffer.
 *
 * Used in decode mode to copy single-token logits from position 0 to
 * the correct sequence position. This moves buffer management logic
 * from codegen to the IR layer, making codegen "dumb" - just emit
 * kernel calls, no runtime if-statements.
 *
 * @param src       Source logits buffer (single token) [vocab_size]
 * @param dst       Destination logits buffer [max_seq_len, vocab_size]
 * @param position  Token position index (0-based)
 * @param vocab_size Number of logits per token
 */

kv_cache_kernels.c — logits_copy_to_position() in the same 191-line source file

void logits_copy_to_position(const float *__restrict src,
                              float *__restrict dst,
                              int position,
                              int vocab_size)
{
    if (!src || !dst || position < 0 || vocab_size <= 0) {
        return;
    }
    // Copy logits to dst[position * vocab_size : (position+1) * vocab_size]
    // Use memmove for safety in case src and dst overlap (e.g., src == dst)
    float *dst_pos = dst + (size_t)position * (size_t)vocab_size;
    memmove(dst_pos, src, (size_t)vocab_size * sizeof(float));
}

The Store-and-Read Flow in Generated Code

The generated Qwen3-0.6B decode path makes the lifecycle obvious. First project Q, K, and V into scratch buffers. Then normalize and apply RoPE. Then call kv_cache_store() for this layer and position. Immediately afterward, call the decode attention kernel using the newly extended cache.

That ordering matters. The current token's K and V are written before attention reads the cache, so the layer can attend to the full prefix including the token just projected. Generated C turns that rule into a fixed op schedule instead of runtime branching.

For layer 0 in the real artifact, Op 9 stores K and V, and Op 10 reads them back through attention_forward_decode_head_major_gqa_flash. The buffer math is completely static: only model->pos changes at runtime. This is the “dumb codegen” idea from the IR pipeline post in action. The hard decision was made earlier: which buffer, which layer, which kernel, which offset. The generated file only replays the plan.

Generated C flow for one decoder layer: project Q/K/V, apply RoPE, store K/V into cache, then read the cache during decode attention.

One-layer decode flow as a compact step list

text

1. Q = gemv_q8_0_q8_0(wq, x) -> q_scratch
2. K = gemv_q8_0_q8_0(wk, x) -> k_scratch
3. V = gemv_q8_0_q8_0(wv, x) -> v_scratch
4. qk_norm(q, k)
5. rope_qk(q, k, cos, sin, pos)
6. kv_cache_store(kv_cache_k_L0, kv_cache_v_L0, k_scratch, v_scratch, pos)
7. attention_forward_decode_head_major_gqa_flash(q, kv_cache_k_L0, kv_cache_v_L0, out, pos)
8. out_proj, residual, MLP

Real generated C from Qwen3-0.6B model_v7.c — Op 9 and Op 10

    /* Op 9: kv_cache_store (kv_cache_store) layer=0 section=body */
    kv_cache_store(
        (float*)((model->kv_cache + (0*2)*NUM_KV_HEADS*MAX_SEQ_LEN*HEAD_DIM)),
        (float*)((model->kv_cache + (0*2+1)*NUM_KV_HEADS*MAX_SEQ_LEN*HEAD_DIM)),
        (const float*)(model->bump + A_K_SCRATCH),
        (const float*)(model->bump + A_V_SCRATCH),
        0,
        model->pos,
        8,
        128,
        1024
    );
    if (stop_seq == 9) return;
    /* Op 10: attention_forward_decode_head_major_gqa_flash (attn) layer=0 section=body */
    attention_forward_decode_head_major_gqa_flash(
        (const float*)(model->bump + A_Q_SCRATCH),
        (const float*)((model->kv_cache + (0*2)*NUM_KV_HEADS*MAX_SEQ_LEN*HEAD_DIM)),
        (const float*)((model->kv_cache + (0*2+1)*NUM_KV_HEADS*MAX_SEQ_LEN*HEAD_DIM)),
        (float*)(model->bump + A_ATTN_SCRATCH),
        16,
        8,
        model->pos + 1,
        1024,
        128,
        128
    );

Real generated defines around the activation-side KV cache pointer

#define A_TOKEN_IDS 639604274
#define A_EMBEDDED_INPUT 639608370
#define A_LAYER_INPUT 643802674
#define A_RESIDUAL 647996978
#define A_KV_CACHE 652191282
#define A_ROPE_CACHE 887072306
#define A_Q_SCRATCH 887596594
#define A_K_SCRATCH 895985202
#define A_V_SCRATCH 900179506

Pointer arithmetic breakdown for one layer

NUM_KV_HEADS = 8
MAX_SEQ_LEN = 1024
HEAD_DIM = 128
BYTES = 4
floats_per_layer_side = NUM_KV_HEADS * MAX_SEQ_LEN * HEAD_DIM
bytes_per_layer_side = floats_per_layer_side * BYTES      # 4,194,304
bytes_per_layer_kv = 2 * bytes_per_layer_side             # 8,388,608

FP16 KV Cache — Halving the Memory

KV cache is a natural target for compression because K and V are intermediate activations, not immutable model weights. Storing them in FP16 is usually far safer than aggressively quantizing the weights that define the model itself. You are reducing cache bandwidth and footprint, not rewriting learned parameters.

For Qwen3-0.6B, an FP16 cache would take the raw tensor from 224 MiB down to 112 MiB. The gain scales linearly with context and layer count, so the bigger the deployment, the more valuable this becomes. CKE exposes it as an ordinary kernel/layout choice, not as a hidden runtime mode.

The mechanism is explicit. kv_cache_store_f16() converts each FP32 element to FP16 when the token is written. Decode attention then uses a kernel variant that knows it is reading the rounded cache representation.

This is a great example of template-driven specialization. The cache dtype is described in the model template and the kernel registry; the IR builder then selects the correct decode kernel before C is emitted. 112 MBFor the Qwen3-0.6B shape, FP16 halves 234,881,024 bytes to 117,440,512 bytes. Long-context deployments feel that reduction immediately.

FP32 versus FP16 KV cache memory footprint comparison, showing half the bytes and lower bandwidth pressure on decode.

The FP16 write path in kv_cache_store_f16()

void kv_cache_store_f16(uint16_t *__restrict kv_cache_k,
                        uint16_t *__restrict kv_cache_v,
                        const float *__restrict k,
                        const float *__restrict v,
                        int layer,
                        int pos,
                        int num_kv_heads,
                        int head_dim,
                        int max_seq_len)
{
    (void)layer;
    if (!kv_cache_k || !kv_cache_v || !k || !v) {
        return;
    }
    if (num_kv_heads <= 0 || pos < 0 || head_dim <= 0 || max_seq_len <= 0) {
        return;
    }
    if (pos >= max_seq_len) {
        return;
    }
    const size_t head_stride = (size_t)max_seq_len * (size_t)head_dim;
    const size_t token_stride = (size_t)head_dim;
    for (int h = 0; h < num_kv_heads; ++h) {
        const float *k_src = k + (size_t)h * token_stride;
        const float *v_src = v + (size_t)h * token_stride;
        uint16_t *k_dst = kv_cache_k + (size_t)h * head_stride + (size_t)pos * token_stride;
        uint16_t *v_dst = kv_cache_v + (size_t)h * head_stride + (size_t)pos * token_stride;
        ck_local_fp32_to_fp16_row(k_src, k_dst, head_dim);
        ck_local_fp32_to_fp16_row(v_src, v_dst, head_dim);
    }
}
/**
 * @brief Copy logits to position-indexed location in output buffer.
 *
 * Used in decode mode to copy single-token logits from position 0 to
 * the correct sequence position. This moves buffer management logic
 * from codegen to the IR layer, making codegen "dumb" - just emit
 * kernel calls, no runtime if-statements.
 *
 * @param src       Source logits buffer (single token) [vocab_size]
 * @param dst       Destination logits buffer [max_seq_len, vocab_size]
 * @param position  Token position index (0-based)
 * @param vocab_size Number of logits per token
 */

Qwen3-VL template marks decode KV cache as FP16

json

{
  "rope_layout": "multi_section_1d",
  "rope_type": "mrope",
  "qk_norm": true,
  "kv_layout": "layer_major_kv_cache",
  "decode_kv_cache_dtype": "fp16",
  "attn_variant": "dense"
}

Llama template binds FP16-aware prefill and decode attention kernels

json

{
  "attention_contract": {
    "rope_layout": "pairwise",
    "rope_type": "rope",
    "qk_norm": false,
    "kv_layout": "layer_major_kv_cache",
    "attn_variant": "dense",
    "train_runtime_contract": {
      "saved_tensor_kernel_overrides": {
        "attn_weights": "attention_forward_causal_head_major_gqa_exact"
      }
    }
  },
  "kernels": {
    "rope_qk": "rope_forward_qk_pairwise",
    "attn_prefill": "attention_forward_causal_head_major_gqa_flash_strided_f16kv",
    "attn_decode": "attention_forward_decode_head_major_gqa_flash_f16kv"
  }
}

Kernel registry entry for the FP16-rounded decode attention path

json

{
  "id": "attention_forward_decode_head_major_gqa_flash_f16kv",
  "variant": "decode_head_major_gqa_flash_f16kv",
  "inputs": [
    {
      "name": "q_token",
      "dtype": "fp32",
      "shape": [
        "num_heads",
        "head_dim"
      ],
      "desc": "Query for single token [num_heads, head_dim]"
    },
    {
      "name": "k_cache",
      "dtype": "fp32",
      "shape": [
        "num_kv_heads",
        "cache_capacity",
        "head_dim"
      ],
      "desc": "K cache [num_kv_heads, max_seq_len, head_dim]"
    },
    {
      "name": "v_cache",
      "dtype": "fp32",
      "shape": [
        "num_kv_heads",
        "cache_capacity",
        "head_dim"
      ],
      "desc": "V cache [num_kv_heads, max_seq_len, head_dim]"
    }
  ],
  "dims": [
    "num_heads",
    "num_kv_heads",
    "kv_tokens",
    "cache_capacity",
    "head_dim",
    "aligned_head_dim"
  ],
  "notes": "Decode flash attention variant that rounds K/V through FP16 to match llama.cpp flash-attn input handling."
}

Ephemeral Context — When You Don't Need KV Cache

Not every attention block wants a persistent history. Vision encoders are the clean counterexample. They process a fixed image token set, apply bidirectional attention once, produce embeddings, and stop.

That is why CKE templates for vision encoders use ephemeral_full_context instead of layer_major_kv_cache. K and V still exist during the computation, but they are not carried across future decode steps because there are no future decode steps inside the vision encoder.

The distinction is architectural, not cosmetic. Decoder attention is causal and persistent. Vision attention is bidirectional and disposable. The template system captures that difference in one field and the rest of the pipeline inherits it.

This is one of the quiet strengths of the template system from the IR pipeline post. The same compiler can build persistent autoregressive caches and zero-persistence vision passes because the memory contract is part of the model specification. Ephemeral attention is the statement “compute K/V, use K/V, forget K/V.” KV cache is the opposite: “compute once, keep until the conversation ends or the window evicts it.”

Qwen3-VL vision encoder attention contract — no persistent cache

json

{
  "rope_layout": "multi_section_2d",
  "rope_mode": "vision",
  "position_encoding": "absolute_2d",
  "kv_layout": "ephemeral_full_context",
  "attn_variant": "dense_bidirectional",
  "causal": false
}

SigLIP ViT template also uses ephemeral_full_context

json

{
  "rope_layout": "none",
  "position_encoding": "absolute_2d",
  "kv_layout": "ephemeral_full_context",
  "attn_variant": "dense_bidirectional",
  "causal": false
}

Decoder versus vision memory behavior in one sketch

text

decoder:
  causal = true
  kv_layout = layer_major_kv_cache
  state persists across tokens
vision encoder:
  causal = false
  kv_layout = ephemeral_full_context
  state dies after the pass

Prefill vs Decode — Two Different KV Cache Patterns

Prefill and decode both populate the same cache, but they do it with different motion patterns. Prefill processes many prompt tokens at once, so it prefers bulk operations. Decode handles one token, so it prefers one-position appends.

CKE reflects that difference in the IR builder. After rope_qk in decode mode, it auto-inserts kv_cache_store and rewires attention to a decode kernel that reads k_cache and v_cache. In prefill mode, it inserts a token-block copy op after attention and lets codegen emit two bulk memcpy calls.

This split is not a micro-optimization. It is a statement about execution shape. Batch prefill wants contiguous token blocks. Single-token decode wants a precise append at pos.

Notice where the intelligence lives. The IR builder decides which ops should exist. Codegen only writes the already-decided calls. That is exactly the compiler split from the IR pipeline post applied to cache behavior. The same persistent buffer gets filled in two distinct ways: bulk-copy for prefill, single-token append for decode.

Prefill versus decode KV-cache population: block copy for prompt tokens versus single-position append during generation.

Prefill versus decode in one compact comparison

text

prefill:
  GEMM over all prompt tokens
  compute all K and V in blocks
  bulk-copy into cache with kv_cache_batch_copy
decode:
  GEMV for one new token
  compute one K and one V
  append with kv_cache_store
  scan cache with decode attention

build_ir_v7.py — decode auto-inserts kv_cache_store after rope_qk

python

    for i, op in enumerate(lowered_ops):
        final_ops.append(op)
        if mode == "decode":
            # After rope_qk, insert kv_cache_store
            if op["op"] == "rope_qk":
                layer = op["layer"]
                kv_store_op = {
                    "idx": len(final_ops),  # Will be renumbered
                    "kernel": "kv_cache_store",
                    "op": "kv_cache_store",
                    "layer": layer,
                    "section": op["section"],
                    "function": "kv_cache_store",
                    "weights": {},
                    "inputs": {
                        "k": {"type": "scratch", "source": "k_scratch"},
                        "v": {"type": "scratch", "source": "v_scratch"},
                    },
                    "outputs": {
                        "kv_cache_k": {"type": "kv_cache", "buffer": f"kv_cache_k_L{layer}"},
                        "kv_cache_v": {"type": "kv_cache", "buffer": f"kv_cache_v_L{layer}"},
                    },
                    "scratch": [],
                    "_auto_inserted": True,
                }
                final_ops.append(kv_store_op)
                kv_store_count += 1
            # For decode mode, update attention ops to use decode kernel
            if op["op"] in ("attn", "attn_sliding") and "attention" in op["kernel"]:
                # Switch to decode attention kernel (sliding vs non-sliding)
                if op["op"] == "attn_sliding":
                    decode_kernel = template_kernels.get("attn_sliding_decode") or "attention_forward_decode_head_major_gqa_flash_sliding"
                else:
                    if force_decode_attn_regular:
                        decode_kernel = "attention_forward_decode_head_major_gqa_regular"

build_ir_v7.py — decode attention is rewired to read k_cache and v_cache

python

                    else:
                        decode_kernel = template_kernels.get("attn_decode") or "attention_forward_decode_head_major_gqa_flash"
                op["kernel"] = decode_kernel
                op["function"] = decode_kernel
                # Update inputs to use KV cache instead of scratch
                op.setdefault("inputs", {})
                op["inputs"]["k_cache"] = {"type": "kv_cache", "source": f"kv_cache_k_L{op['layer']}"}
                op["inputs"]["v_cache"] = {"type": "kv_cache", "source": f"kv_cache_v_L{op['layer']}"}
                # Remove scratch K/V references if present
                op["inputs"].pop("k", None)
                op["inputs"].pop("v", None)
        elif mode == "prefill":
            # For prefill: after q_proj/k_proj/v_proj, insert transpose from [T, H*D] to [H, T, D]
            # GEMM outputs token-major but attention expects head-major
            if op["op"] in ("q_proj", "split_q_gate"):
                layer = op["layer"]
                transpose_q_op = {
                    "idx": len(final_ops),
                    "kernel": "transpose_qkv_to_head_major",
                    "op": "transpose_qkv_to_head_major",
                    "layer": layer,
                    "section": op["section"],
                    "function": "transpose_inplace",

build_ir_v7.py and codegen_prefill_v7.py — prefill inserts kv_cache_batch_copy

python

            # For prefill: after attention, transpose output from head-major [H, T, D] to token-major [T, H*D]
            # Then insert kv_cache_batch_copy to copy K/V from scratch to cache
            if op["op"] in ("attn", "attn_sliding"):
                layer = op["layer"]
                # First: transpose attention output from head-major to token-major
                transpose_attn_out_op = {
                    "idx": len(final_ops),
                    "kernel": "transpose_attn_out_to_token_major",
                    "op": "transpose_attn_out_to_token_major",
                    "layer": layer,
                    "section": op["section"],
                    "function": "transpose_inplace",
                    "weights": {},
                    "inputs": {"buf": {"type": "scratch", "source": "attn_scratch"}},
                    "outputs": {"buf": {"type": "scratch", "buffer": "attn_scratch"}},
                    "scratch": [],
                    "_auto_inserted": True,
                }
                final_ops.append(transpose_attn_out_op)
                # Second: kv_cache_batch_copy
                # Contract check: validate this op against runtime_invariants:
                # _kv_copy_bytes must exist and match (num_kv_heads * head_dim * seq_len * sizeof(fp32)).
                kv_batch_copy_op = {
                    "idx": len(final_ops),
                    "kernel": "kv_cache_batch_copy",
                    "op": "kv_cache_batch_copy",
                    "layer": layer,
                    "section": op["section"],
                    "function": "kv_cache_batch_copy",  # Codegen emits two memcpy calls (K and V)
                    "weights": {},
                    "inputs": {
                        "k_src": {"type": "scratch", "source": "k_scratch"},
                        "v_src": {"type": "scratch", "source": "v_scratch"},
                    },
                    "outputs": {
                        "k_dst": {"type": "kv_cache", "buffer": f"kv_cache_k_L{layer}"},
                        "v_dst": {"type": "kv_cache", "buffer": f"kv_cache_v_L{layer}"},
                    },
                    "scratch": [],
                    "params": {
                        "num_kv_heads": int(config.get("num_kv_heads", 1)),
                        "head_dim": int(config.get("head_dim", 1)),

Generated prefill emitter — two memcpy calls into the persistent cache

    if op_type == "kv_cache_batch_copy":
        # Copy K/V from scratch (head-major after transpose) to KV cache
        # Scratch layout: [num_kv_heads, num_tokens, head_dim] (compact, head-major)
        # KV cache layout: [num_kv_heads, max_seq_len, head_dim] (with stride, head-major)
        layer = op.get("layer", 0)
        num_kv_heads = config.get("num_kv_heads", 2)
        head_dim = config.get("head_dim", 64)
        context_len = config.get("context_len", config.get("context_length", 1024))
        return f"""    /* Op {seq_idx}: kv_cache_batch_copy layer={layer} */
    /* Copy K/V from head-major scratch to KV cache for subsequent decode */
    {{
        const int Hkv = {num_kv_heads};
        const int D = {head_dim};
        const int cache_stride = {context_len};
        float *k_scratch = (float*)(model->bump + A_K_SCRATCH);
        float *v_scratch = (float*)(model->bump + A_V_SCRATCH);
        float *kv_cache = (float*)model->kv_cache;
        for (int h = 0; h < Hkv; h++) {{
            /* K: copy from scratch[h, 0:num_tokens, :] to cache[h, 0:num_tokens, :] */
            /* Scratch is compact: stride = num_tokens, Cache has stride = cache_stride */
            memcpy(
                kv_cache + ({layer}*2)*Hkv*cache_stride*D + h*cache_stride*D,
                k_scratch + h*num_tokens*D,
                (size_t)num_tokens * D * sizeof(float)
            );
            /* V: copy from scratch[h, 0:num_tokens, :] to cache[h, 0:num_tokens, :] */
            memcpy(
                kv_cache + ({layer}*2+1)*Hkv*cache_stride*D + h*cache_stride*D,
                v_scratch + h*num_tokens*D,
                (size_t)num_tokens * D * sizeof(float)
            );
        }}
    }}"""

Sliding Window Attention — KV Cache Eviction

A raw KV cache grows linearly with context length. Sliding-window attention is the mechanism that stops that growth from being unbounded in some layers. Instead of preserving every token forever, a sliding layer only needs the most recent window of positions.

Gemma4 makes this explicit in CKE. Its template declares a hybrid_sliding_attention variant and names separate kernels for full-context and sliding-context behavior. Some layers keep full history. Some layers evict old tokens behind a moving window.

That hybrid design is a modern compromise. Full-context layers preserve global recall where it matters. Sliding layers cap memory and bandwidth where locality is enough. The systems effect is bounded cache growth in part of the network.

Sliding window attention is really controlled forgetting. The cache does not vanish; it becomes a ring or bounded history for the layers that can tolerate eviction. W tokensFor a sliding layer with window W, KV memory scales with W instead of total sequence length T. That is the eviction lever for long-context models.

Gemma4 attention contract names the hybrid sliding-window policy keys

json

{
  "rope_layout": "split",
  "rope_type": "rope",
  "qk_norm": true,
  "kv_layout": "layer_major_kv_cache",
  "attn_variant": "hybrid_sliding_attention",
  "layer_policy_config_key": "layer_attention_plan",
  "layer_kind_config_key": "layer_kinds",
  "kv_policy_config_key": "layer_kv_policy",
  "kv_source_config_key": "layer_kv_source",
  "sliding_window_config_key": "layer_sliding_window",
  "rope_kind_config_key": "layer_rope_kind"
}

Gemma4 kernel bindings split full-context and sliding decode paths

json

{
  "rope_qk": "rope_forward_qk_gemma4",
  "rope_init": "rope_precompute_cache_split",
  "attn": "attention_forward_causal_head_major_gqa_flash_strided_gemma4",
  "attn_sliding": "attention_forward_causal_head_major_gqa_flash_strided_sliding_gemma4",
  "attn_decode": "attention_forward_decode_head_major_gqa_flash_gemma4",
  "attn_sliding_decode": "attention_forward_decode_head_major_gqa_flash_sliding_gemma4"
}

Sliding-window eviction sketch

python

window = 4096
for t in range(total_tokens):
    slot = t % window
    kv_cache_k[slot] = k_t
    kv_cache_v[slot] = v_t
    begin = max(0, t - window + 1)
    attend(q_t, kv_cache_k[begin:t+1], kv_cache_v[begin:t+1])

Memory Planning Integration — How the IR Builder Sizes KV Cache

The most important engineering fact about CKE's KV cache is that it is planned, not allocated on the fly. The builder computes its exact size from model metadata, assigns it a physical slot in the activation arena, and emits that decision into both the layout artifact and the generated C.

That means the cache is not competing with scratch buffers at runtime. Scratch buffers are reused because their lifetimes do not overlap. KV cache is persistent across tokens, so it gets its own dedicated region.

The layout map from the Qwen3 run shows the result clearly. kv_cache begins at offset 0x000000C05000 and occupies 234,881,024 bytes. It sits after residual and before rope_cache inside the activation arena.

This is deterministic memory planning in the literal sense. The address is known before the generated library is compiled. The generated file then bakes in matching constants like KV_CACHE_SIZE and A_KV_CACHE. Persistent buffers are a compile-time decision in CKE. That is why there is no allocator drama during token generation.

Deterministic activation arena with kv_cache placed between residual and rope_cache in the Qwen3-0.6B decode layout.

build_ir_v7.py — exact size formula for the persistent cache buffer

python

    # KV cache + RoPE
    kv_per_layer = num_kv_heads * context_len * head_dim * 4
    total_kv_size = num_layers * 2 * kv_per_layer
    add("kv_cache", total_kv_size, f"[{num_layers}, 2, {num_kv_heads}, {context_len}, {head_dim}]")

build_ir_v7.py — per-layer offset helper for kv_cache_k and kv_cache_v

python

    def kv_layer_offsets(layer: int) -> Optional[Tuple[int, int]]:
        kv_buf = activation_buffers.get("kv_cache")
        if not kv_buf or not context_len or not num_kv_heads or not head_dim:
            return None
        kv_per_layer = num_kv_heads * context_len * head_dim * 4
        base = kv_buf["offset"] + layer * 2 * kv_per_layer
        return base, base + kv_per_layer

layout_decode.map — activation buffer table from the real run artifact

text

Offset         End            Size (bytes)               Buffer                   Shape                         
------------------------------------------------------------------------------------------------------------------------
0x000000000000 0x000000004000       16,384  (  16.00 KB)  text_input               [16384]                       
0x000000004000 0x000000005000        4,096  (   4.00 KB)  token_ids                [1024]                        
0x000000005000 0x000000405000    4,194,304  (   4.00 MB)  embedded_input           [1024, 1024]                  
0x000000405000 0x000000805000    4,194,304  (   4.00 MB)  layer_input              [1024, max(1024, Q8_K(3072))] 
0x000000805000 0x000000C05000    4,194,304  (   4.00 MB)  residual                 [1024, 1024]                  
0x000000C05000 0x00000EC05000  234,881,024  ( 224.00 MB)  kv_cache                 [28, 2, 8, 1024, 128]         
0x00000EC05000 0x00000EC85000      524,288  ( 512.00 KB)  rope_cache               [2, 1024, 64]                 
0x00000EC85000 0x00000F485000    8,388,608  (   8.00 MB)  q_scratch                [16, 1024, 128]               
0x00000F485000 0x00000F885000    4,194,304  (   4.00 MB)  k_scratch                [8, 1024, 128]                
0x00000F885000 0x00000FC85000    4,194,304  (   4.00 MB)  v_scratch                [8, 1024, 128]                
0x00000FC85000 0x000010C85000   16,777,216  (  16.00 MB)  attn_q_gate_packed       [1024, 4096]                  
0x000010C85000 0x000011485000    8,388,608  (   8.00 MB)  attn_gate                [1024, 2048]                  
0x000011485000 0x000011C85000    8,388,608  (   8.00 MB)  attn_scratch             [16, 1024, 128]               
0x000011C85000 0x000014485000   41,943,040  (  40.00 MB)  mlp_scratch              [max(1024*6144, fused_attn, geglu_bf16)]
0x000014485000 0x000014885000    4,194,304  (   4.00 MB)  layer_output             [1024, 1024]                  
0x000014885000 0x000039A05000  622,329,856  ( 593.50 MB)  logits                   [1024, 151936]

model_v7.c — KV_CACHE_SIZE in the generated header

/* Memory sizes */
#define WEIGHTS_SIZE 639587338ULL
#define ACTIVATIONS_SIZE 966807552ULL
#define KV_CACHE_SIZE 234881024ULL

model_v7.c — reset and enable behavior in the generated Qwen3 artifact

CK_EXPORT void ck_model_kv_cache_reset(void) {
    if (!g_model) return;
    memset(g_model->kv_cache, 0, KV_CACHE_SIZE);
    g_model->pos = 0;
}
CK_EXPORT int ck_model_kv_cache_enable(int capacity) {
    /* KV cache is always enabled in v7 */
    (void)capacity;
    return 0;
}

KV Cache as the Scaling Bottleneck

Once context length gets large, KV cache becomes the dominant per-request state. The formula is only linear in context, but linear growth across tens or hundreds of thousands of tokens is still brutal. Add concurrency and the deployment budget turns into a memory-management problem immediately.

This is also why decode remains memory-bound. Every token reads a long history of cached K and V values, and each byte participates in relatively little arithmetic. The roofline result from the CPU performance post — arithmetic intensity around 0.031 FLOPs/byte — is exactly the signature you expect from this streaming behavior.

Serving systems respond with a hierarchy of tricks. Architectural tricks like GQA reduce the tensor itself. Numerical tricks like FP16 or FP8 reduce bytes per element. Algorithmic tricks like sliding windows bound history. Systems tricks like PagedAttention manage fragmentation and residency more intelligently across many users.

Technique	Typical savings	Tradeoff
GQA	architectural 2× or better	Fewer KV heads means shared K/V across query heads.
FP16 / FP8 KV	2× or 4× bytes reduction	Slight numerical risk, usually acceptable for cached activations.
Sliding window	bounds growth to W instead of full T	Old tokens are evicted from some layers.
Compression / pruning	workload-dependent	Approximate; can hurt quality or complicate kernels.
PagedAttention	better allocator efficiency	Mainly a serving-systems win, not a math win.

The bottleneck story is therefore stacked. First minimize the tensor with model architecture. Then minimize bytes per element. Then minimize how long old tokens stay relevant. Then manage what is left like a serving system instead of pretending it is a simple array. 0.031 AIThe CPU performance post measured decode at roughly 0.031 FLOPs/byte. That number all but shouts “KV-cache streaming workload.”

Why decode attention lands on the bandwidth side of the roofline

text

decode attention arithmetic intensity ≈ 0.031 FLOPs/byte
meaning:
  every decode token streams lots of cached K/V bytes
  each byte participates in very little arithmetic
  decode therefore sits on the bandwidth side of the roofline

A rough multi-user memory thought experiment

python

per_user_kv = 224 * 1024 * 1024
concurrent_users = 64
fleet_bytes = per_user_kv * concurrent_users
print(fleet_bytes / 1024 / 1024 / 1024)  # 14.0 GiB just for KV cache

PagedAttention exists because serving wants page-level KV management

python

page_size = 16  # tokens per page, illustrative
logical_positions -> page_table[layer][head][page_id]
physical_pages are recycled across requests when tokens expire

What Silicon Vendors Should See

Silicon teams should read this cache design as a benchmark disguised as a runtime feature. The access pattern is known in advance, mostly sequential within a head, and repeated on every decode token. That makes it a near-perfect probe of memory bandwidth, prefetch quality, LLC usefulness, and latency hiding.

CKE is especially revealing because the buffer is statically sized and statically addressed. There is no allocator noise obscuring the profile. The machine sees a deterministic stream of reads and writes whose geometry was chosen at IR build time.

For a CPU vendor, that means the question becomes concrete. How many bytes per second can the platform sustain when a batch-1 decode kernel repeatedly walks these head-major histories? How much of one layer's 8 MiB K+V slice fits comfortably in shared cache before DRAM becomes unavoidable?

This is why DDR bandwidth keeps showing up in inference conversations. A faster FMA pipeline does not rescue a kernel whose limiting operation is “stream old cache lines back into attention.” Better memory systems do. KV cache access pattern is the decode-memory benchmark. The model just gives the benchmark semantic meaning.

How a vendor should mentally read the generated pointer math

for layer in range(NUM_LAYERS):
    k_ptr = kv_cache + (layer * 2 + 0) * layer_stride
    v_ptr = kv_cache + (layer * 2 + 1) * layer_stride
    stream_sequentially(k_ptr, v_ptr)
    # this is a bandwidth benchmark dressed up as attention

Cache-fit intuition for one Qwen3 layer

text

per_layer_side = 4 * 1024 * 1024   # 4 MiB K or 4 MiB V for one layer
per_layer_kv   = 8 * 1024 * 1024   # 8 MiB total per layer
# An L3 that can hold only a few such slices will accelerate only
# those hot layers; the rest spill to DRAM.

Bandwidth thinking — more channels, more decode headroom

text

ARM Neoverse V3-class platforms push much higher aggregate DDR5 bandwidth.
If cached attention is mostly streaming K/V history,
then more channels translate directly into more headroom for batch-1 decode throughput.

Conclusion — From O(T²) to O(T) with One Buffer

KV cache is the simplest big idea in autoregressive inference. Stop recomputing historical keys and values. Store them once. Read them many times.

CKE's contribution is not inventing that idea. It is handling the entire lifecycle deterministically: template declares the layout, IR builder sizes the buffer, memory planner places it, codegen emits fixed offsets, kernels write and read it, and optional FP16 or sliding-window variants change the tradeoff without changing the core story.

That is why this buffer sits at the center of so many adjacent posts. The flash-attention post explained how decode attention streams through cached history. The CPU performance post showed the bandwidth signature in real profiling data. The IR pipeline post showed the compiler pipeline that decides the buffer before C exists. KV cache is the piece that turns those three threads into one systems picture.

The enduring mental model is straightforward: KV cache converts a compute problem into a memory problem. Once you see that, most decode optimizations sort themselves into place. O(T²) → O(T)The cache does not remove the need to read history, but it does remove the need to regenerate that history on every step. That is the decisive asymptotic win.

One-buffer summary

text

Without the cache:
  recompute historical K and V every step.
With the cache:
  append one token,
  read one deterministic buffer,
  trade wasted FLOPs for predictable memory traffic.

CKE recap — sizing, layout, compression, eviction, deterministic offsets

text

CKE KV-cache checklist:
  - static sizing in build_ir_v7.py
  - layer_major_kv_cache layout
  - GQA-aware num_kv_heads accounting
  - FP32 and FP16 store paths
  - prefill bulk-copy and decode single-token append
  - sliding-window compatibility for hybrid models
  - generated C with fixed pointer arithmetic

Continue with the adjacent posts in the series

text

Related posts:
- https://www.shivasnotes.com/blog/5914/flash-attention-on-cpu-online-softmax-cache-discipline-and-cke-kernels
- https://www.shivasnotes.com/blog/5915/cpu-performance-engineering-for-ai-rooflines-flamegraphs-and-vtune
- https://www.shivasnotes.com/blog/5917/v8-ir-pipeline-codegen-how-cke-hardens-pure-c-inference

Continue with Flash Attention on CPU, CPU Performance Engineering for AI, and The v8 IR Pipeline to see the algorithmic, profiling, and compiler sides of the same decode-memory story.

KV Cache Memory: The Hidden State That Makes LLM Decode Work

What this post covers

Introduction — The Memory That Makes Autoregressive Inference Possible

The Math — Why KV Cache Works

The Memory Math — How Big Is It?

GQA — Grouped Query Attention as KV Cache Optimization

Layout — Layer-Major with Head-Major Access

The KV Cache Kernels — 191 Lines of C

The Store-and-Read Flow in Generated Code

FP16 KV Cache — Halving the Memory

Ephemeral Context — When You Don't Need KV Cache

Prefill vs Decode — Two Different KV Cache Patterns

Sliding Window Attention — KV Cache Eviction

Memory Planning Integration — How the IR Builder Sizes KV Cache

KV Cache as the Scaling Bottleneck

What Silicon Vendors Should See

Conclusion — From O(T²) to O(T) with One Buffer

ShivasNotes

Explore

Connect

KV Cache Memory: The Hidden State That Makes LLM Decode Work

What this post covers

Introduction — The Memory That Makes Autoregressive Inference Possible

The Math — Why KV Cache Works

The Memory Math — How Big Is It?

GQA — Grouped Query Attention as KV Cache Optimization

Layout — Layer-Major with Head-Major Access

The KV Cache Kernels — 191 Lines of C

The Store-and-Read Flow in Generated Code

FP16 KV Cache — Halving the Memory

Ephemeral Context — When You Don't Need KV Cache

Prefill vs Decode — Two Different KV Cache Patterns

Sliding Window Attention — KV Cache Eviction

Memory Planning Integration — How the IR Builder Sizes KV Cache

KV Cache as the Scaling Bottleneck

What Silicon Vendors Should See

Conclusion — From O(T²) to O(T) with One Buffer

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect