CPU Performance Engineering for AI: Rooflines, Flamegraphs, VTune, and Perf Gates

CPU performance engineering · C-Kernel-Engine

This ShivasNotes deep dive is written for CPU silicon teams and systems engineers who need to understand not just whether an AI framework is fast, but why it is or isn't fast, and exactly where the bottleneck lives. C-Kernel-Engine doesn't just run inference — it generates a full performance observatory with roofline analysis, flamegraphs, VTune hotspots, cachegrind attribution, and automated perf gates. This post teaches what each tool measures, how to read it, and what it reveals about AI workloads on CPU. Video walkthroughs are published on youtube.com/@antshivrobotics.

The difference between a toy inference runtime and a serious CPU platform is not whether text appears on screen. It is whether the runtime can explain every cycle, every miss, every stall, every hot loop, and every regression threshold in plain artifacts a silicon team can inspect. The recent ShivasNotes CPU-kernel series built the groundwork: SIMD, NEON, quantization, and flash attention. This post is where those ideas are measured against the machine itself.

What this post covers

Sections 1 through 4 establish the language of CPU performance engineering: counters, IPC, caches, TLBs, and branch prediction. Sections 5 through 8 move into the four big observability tools: roofline, VTune, Advisor, and flamegraphs.

Sections 9 through 13 zoom into CKE’s own per-op profiler, heatmaps, Theory of Constraints dashboard, perf gate, and the IR Visualizer pipeline that fuses everything into one offline HTML report. Sections 14 and 15 close with what these measurements mean for AI on CPU and why this matters to silicon vendors.

Introduction — Performance Analysis Is the Real Moat

A surprising amount of AI software stops at the sentence “the model runs.” That sentence is operationally useful, but it is not engineering mastery. The real moat is knowing exactly why the model runs at its current speed, exactly which subsystem is constraining it, and exactly which change would move the ceiling.

Most AI frameworks still treat the CPU as a black box: launch the kernel, time the end-to-end request, and hope the compiler and hardware sort it out. C-Kernel-Engine does the opposite. It treats the CPU as a first-class target with explicit profiling hooks, structured summaries, and a report pipeline designed for post-mortem analysis.

That framing is what makes CKE interesting to silicon teams. The code does not merely say “AVX2 path exists.” It also says “here is the measured IPC, here is the cache miss rate, here is the roofline position, and here is the one function consuming three quarters of total time.” 6 artifacts Perf stat, VTune, Advisor, flamegraph, cachegrind, and per-op profiling all land in one report bundle.

On a single Qwen3-0.6B-Q8_0 run, CKE automatically materializes six profiling artifacts: perf_stat_summary.json, vtune_summary.json, advisor_summary.json, flamegraph_manifest.json, cachegrind_summary.json, and profile_summary.json. Those files are then ingested by the IR Visualizer’s Profile tab, which sits inside a larger 11-tab offline report totaling roughly 24,950 lines of HTML and JavaScript.

CKE profiling artifact inventory for one decode run

text

Model: Qwen3-0.6B-Q8_0
Host: Intel hybrid CPU, AVX2+FMA, 12 threads auto-detected
Artifacts emitted:
  perf_stat_summary.json
  vtune_summary.json
  advisor_summary.json
  flamegraph_manifest.json
  cachegrind_summary.json
  profile_summary.json
Visualizer aggregation target:
  ir_report.html
  Profile tab consumes all six sources
  Total visualizer size ≈ 24,950 lines HTML/JS
  Total tabs = 11

Observed profile run log for the live case study

text

Model   : Qwen3-0.6B-Q8_0
Kernel  : gemm_avx2
Threads : 12
Prefill:
  12 tokens
  387.6 ms
  31.0 tok/s
Decode:
  31 tokens
  1976.2 ms
  15.7 tok/s
  63.7 ms/tok

Pipeline that produces the final IR report

bash

make profile-v7-decode
make profile-v7-perf-stat
make profile-v7-flamegraph
make profile-v7-vtune
make profile-v7-advisor
make profile-v7-cachegrind
python3 version/v7/tools/open_ir_visualizer.py --generate --run <model-dir>

The Hardware Counters — What `perf stat` Measures

Hardware performance counters are small accounting registers inside the CPU’s Performance Monitoring Unit, or PMU. They count events the core really experienced: retired instructions, elapsed cycles, cache lookups, cache misses, branches, branch misses, TLB activity, and page faults. They are the machine telling you what happened, not a profiler guessing from symbols alone.

One of the most important derived metrics is IPC, instructions per cycle. The pipeline roughly moves through fetch → decode → execute → retire. If the machine retires 1.42 instructions per clock, as this run does, it means the core is doing meaningful work but still spending a visible fraction of time waiting on data or execution resources.

A rough engineering shorthand is useful here. IPC below 1.0 usually signals a stall-heavy workload. IPC above 2.0 is generally healthy utilization. IPC above 3.0 is excellent on ordinary scalar-plus-vector server/client code. Qwen3 decode lands at IPC = 1.42: not catastrophic, not brilliant, and exactly what you expect from memory-sensitive batch-1 inference.

Hybrid Intel parts complicate the story in a good way. The PMU can expose both cpu_core and cpu_atom event domains, so CKE can track P-core and E-core behavior separately. In this run the P-cores show lower IPC at 1.27 because they are handling heavier SIMD and memory pressure, while E-cores reach 1.61 on lighter work.

Counter	P-core value	E-core value	Derived metric
`instructions`	35.6B	15.0B	50.6B total retired instructions
`cycles`	28.0B	9.3B	35.6B total cycles observed
`IPC`	1.27	1.61	1.42 overall IPC
`cache-references`	0.69B	0.32B	1.01B total lookups
`cache-misses`	0.46B	0.19B	63.9% miss rate overall
`branches`	1.20B	0.57B	1.77B branches total
`branch-misses`	6.3M	3.0M	0.52% miss rate
`dTLB-loads`	6.7B	3.38B	10.08B total
`dTLB-load-misses`	7.7M	3.7M	0.11% load miss rate

The perf stat event set wired into the v7 Makefile

bash

perf_events="cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,branches,branch-misses,stalled-cycles-frontend,stalled-cycles-backend,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,minor-faults,major-faults"
if perf list 2>/dev/null | grep -q "dtlb_load_misses.walk_completed"; then
    perf_events="$perf_events,cpu_core/dtlb_load_misses.walk_completed/,cpu_core/dtlb_store_misses.walk_completed/,cpu_core/itlb_misses.walk_completed/,cpu_core/dtlb_load_misses.stlb_hit/,cpu_core/itlb_misses.stlb_hit/"
fi
perf stat --all-user -e "$perf_events"     ./build/ck-cli-v7 <runtime>     --prompt "The quick brown fox" --max-tokens 32 --timing --quiet-output     2> build/ck_v7_perf_stat.txt

CSV-aware parser from perf_artifacts_v7.py

python

def parse_perf_stat_text(text: str) -> Dict[str, object]:
    counters: Dict[str, float] = {}
    notes: Dict[str, str] = {}
    elapsed_sec: Optional[float] = None
    rows = []
    line_re = re.compile(r"^\s*([<\w\.,-]+)\s+([A-Za-z0-9_\-\.\/:]+)(?:\s+(?:#\s*)?(.*))?$")
    elapsed_re = re.compile(r"^\s*([0-9]*\.?[0-9]+)\s+seconds\s+time\s+elapsed")
    for line in text.splitlines():
        m_elapsed = elapsed_re.search(line)
        if m_elapsed:
            try:
                elapsed_sec = float(m_elapsed.group(1))
            except ValueError:
                pass
            continue
        m = line_re.match(line)
        if not m:
            parsed_csv = parse_perf_stat_csv_line(line)
            if parsed_csv is None:
                continue
            metric, value, note = parsed_csv
        else:
            raw_value, metric, note = m.groups()

Derived metrics emitted by perf_artifacts_v7.py

python

if inst is not None and cyc and cyc > 0:
    derived["ipc"] = inst / cyc
if cache_ref and cache_ref > 0 and cache_miss is not None:
    derived["cache_miss_rate"] = cache_miss / cache_ref
if branches and branches > 0 and branch_miss is not None:
    derived["branch_miss_rate"] = branch_miss / branches
if dtlb_loads and dtlb_loads > 0 and dtlb_load_miss is not None:
    derived["dtlb_load_miss_rate"] = dtlb_load_miss / dtlb_loads
if dtlb_stores and dtlb_stores > 0 and dtlb_store_miss is not None:
    derived["dtlb_store_miss_rate"] = dtlb_store_miss / dtlb_stores
if has_page_walks:
    derived["page_walks"] = page_walks
if minor_faults is not None and inst and inst > 0:
    derived["minor_faults_per_kinst"] = (float(minor_faults) * 1000.0) / float(inst)
if major_faults is not None and inst and inst > 0:
    derived["major_faults_per_kinst"] = (float(major_faults) * 1000.0) / float(inst)

Representative perf_stat_summary.json fields for the Qwen3 run

json

{
  "derived": {
    "ipc": 1.42,
    "cache_miss_rate": 0.639,
    "branch_miss_rate": 0.0052,
    "dtlb_load_miss_rate": 0.0011,
    "dtlb_store_miss_rate": 0.0468,
    "page_walks": 17200000.0
  },
  "counters": {
    "instructions": 50600000000.0,
    "cycles": 35600000000.0,
    "cache-references": 1010000000.0,
    "cache-misses": 646900000.0,
    "branches": 1770000000.0,
    "branch-misses": 9290000.0
  },
  "elapsed_seconds": 0.89
}

The Cache Hierarchy — Where Bytes Live and Die

Modern CPU performance is a geography problem. The closer the bytes are to the core, the cheaper the access. As a mental model, think L1 data cache at roughly 32 KB per core and about a nanosecond away, L2 around 1.25 MB and a few nanoseconds away, L3 shared across cores at tens of megabytes and perhaps 10–15 ns away, and DRAM far away at 50–100 ns or worse.

That latency ladder is why cache statistics matter so much. This run reports 1.01B cache references and 646.9M cache misses, a 63.9% miss rate. That number looks shocking until you remember what decode is doing: batch-1 GEMV streams weight rows through the machine once per token, so the working set behaves more like a read-once river than a hot reusable tile.

Qwen3-0.6B in Q8_0 is roughly a 640 MB weight footprint before metadata overhead. The shared L3 on a client Intel part is more like 18–30 MB. Streaming 640 MB through a 24 MB last-level cache once per token is not a “maybe miss” scenario. It is a guaranteed capacity miss story. Quantization helps because it shrinks the river. It does not magically turn decode into an L1-resident compute kernel.

Level	Typical capacity	Approx latency	Why it matters here
L1D	~32 KB per core	~1 ns	Only tiny hot vectors and loop state reliably live here.
L2	~1.25 MB per core	~4 ns	Good for blocked kernels, too small for decode weight streaming.
L3 / LLC	~18–30 MB shared	~10–15 ns	Acts as a pressure valve, not a home, for 640 MB of weights.
DRAM	System memory	~50–100 ns	The real wall for single-token decode.

The last-level-cache counters make the story more concrete. On the P-cores, LLC loads are 16.8M and LLC load misses are 6.5M, which is a 38.7% LLC miss rate. Those misses are the accesses that escape the chip-level cache and become real DRAM traffic.

The TLB story is gentler but still relevant. There are 10.08B dTLB loads and 11.4M dTLB load misses, only 0.11%, yet that still becomes 17.2M total page walks. When large weight tensors span many 4 KB pages, huge pages can reduce translation overhead even if the main bottleneck remains bandwidth.

Cache hierarchy intuition for decode

text

Closest first:
  L1D  -> tiny, extremely fast, per-core
  L2   -> bigger, still local, per-core
  L3   -> shared, much larger, still on-die
  DRAM -> huge, slow, off-core and off-cache hierarchy
Decode implication:
  weight rows stream from deeper levels
  reuse is low at batch = 1
  misses are not an accident
  misses are the workload shape

Why Q8_0 decode keeps missing cache

text

Qwen3-0.6B, Q8_0:
  ~0.6B parameters
  ~1 byte / weight  -> ~600 MB raw payload
  plus scales / metadata -> ~640 MB practical footprint
Client Intel LLC:
  ~24 MB shared cache
Per decode token:
  each row used once
  each layer streams its weights again
  640 MB >> 24 MB
Result:
  high capacity miss rate is expected

TLB and page-walk metrics from the case study

json

{
  "dTLB-loads": 10080000000.0,
  "dTLB-load-misses": 11400000.0,
  "dtlb_load_miss_rate": 0.0011,
  "dTLB-stores": 160000000.0,
  "dTLB-store-misses": 7490000.0,
  "dtlb_store_miss_rate": 0.0468,
  "page_walks": 17200000.0,
  "minor-faults": 394000.0,
  "major-faults": 16.0
}

Huge-page checks worth running on large-model hosts

bash

cat /sys/kernel/mm/transparent_hugepage/enabled
cat /proc/meminfo | grep -E 'Huge|AnonHuge'
getconf PAGE_SIZE
# If the platform allows explicit huge pages:
sysctl vm.nr_hugepages=4096
numactl --hardware

Cachegrind parser logic used to summarize memory behavior

python

def derive_metrics(totals: Dict[str, int]) -> Dict[str, float]:
    ir = float(totals.get("Ir", 0))
    dr = float(totals.get("Dr", 0))
    dw = float(totals.get("Dw", 0))
    d1mr = float(totals.get("D1mr", 0))
    d1mw = float(totals.get("D1mw", 0))
    llmr = float(totals.get("LLmr", 0))
    llmw = float(totals.get("LLmw", 0))
    data_refs = dr + dw
    d1_misses = d1mr + d1mw
    ll_misses = llmr + llmw
    out: Dict[str, float] = {}
    if data_refs > 0:
        out["d1_miss_rate"] = d1_misses / data_refs
        out["ll_miss_rate"] = ll_misses / data_refs
        out["ll_miss_given_d1_miss"] = (ll_misses / d1_misses) if d1_misses > 0 else 0.0
    return out

CPU cache hierarchy diagram showing L1, L2, L3, DRAM, TLBs, and why streamed LLM weights generate capacity misses during decode.

Branch Prediction — The Pipeline's Crystal Ball

Branches are the pipeline’s gamble on the future. When the CPU sees an if, a loop back-edge, or any control-flow fork, it predicts which path will be needed next so fetch and decode keep moving. A wrong guess flushes partially built work and can cost on the order of 15 cycles or more on a modern out-of-order core.

The good news is that well-structured numerical kernels are usually predictable. Inner SIMD loops tend to be regular, count-controlled, and free of data-dependent control flow. That is exactly what the CKE counters show: 1.77B branches and only 9.29M misses, for a branch miss rate of 0.52%.

This is one of the cleanest signals in the whole report. Branch prediction is not the bottleneck here. The hot loops are straight-line numeric code, so the machine is mostly losing time on data movement rather than speculative control flow. Modern CPUs often exceed 99% branch prediction accuracy on regular loops. A 0.52% miss rate is exactly the kind of number you want to see in vectorized inference code.

Branch metrics for the live Qwen3 decode run

json

{
  "branches": 1770000000.0,
  "branch-misses": 9290000.0,
  "branch_miss_rate": 0.0052,
  "interpretation": "excellent branch behavior"
}

Why SIMD-heavy inner loops usually branch well

text

for (block = 0; block < nb; ++block) {
    load bytes
    widen lanes
    multiply
    accumulate
    reduce
}
Control flow:
  predictable loop counter
  predictable exit test
  minimal data-dependent branching
Penalty avoided:
  fewer pipeline flushes
  better retirement continuity

Roofline Analysis — The Most Important Chart in Performance Engineering

If there is one chart every CPU performance engineer should know, it is the roofline model from Williams, Waterman, and Patterson (2009). It compresses a huge amount of machine behavior into two axes: arithmetic intensity on the X-axis and attainable performance on the Y-axis. Both are usually plotted on log scales because the meaningful range spans orders of magnitude.

Arithmetic intensity means FLOPs per byte of DRAM traffic. If a kernel does almost no arithmetic for each byte fetched, it lives on the left side of the chart and is limited by the slanted memory-bandwidth roof. If it does a great deal of arithmetic per byte, it moves rightward until it collides with the horizontal peak-compute roof.

The intersection of those two ceilings is the ridge point. Left of the ridge, more ALUs do not help because the cores are starved for data. Right of the ridge, more memory bandwidth does not help because the machine is already compute-saturated. For an AVX2+FMA desktop-class part, a ridge point around 10–17 FLOPs/byte is a reasonable intuition. The measured float AI here is 0.031. That is not near the ridge. It is nowhere close.

Metric	Value	Interpretation
Float arithmetic intensity	0.031 FLOPs/byte	Deeply memory-bound float path.
Integer arithmetic intensity	0.895 FLOPs/byte	Quantized integer work has much better byte efficiency.
Mixed arithmetic intensity	0.927 FLOPs/byte	Still far left of a typical AVX2 ridge point.
Total GFLOP/s	0.807	Tiny fraction of compute peak because bandwidth dominates.
Total GINTOP/s	23.04	Quantized integer throughput is doing the practical work.
DRAM bandwidth	32.6 GB/s	The active ceiling for decode.
L1 / L2 / L3 bandwidth	3886 / 1768 / 730 GB/s	On-die bandwidth is huge relative to DRAM, but decode does not stay resident there.
SP / DP FMA peak	559.3 / 253.3 GFLOP/s	Horizontal compute roofs.
Vectorized loop share	76.8%	Most runtime is already vectorized.
Detected loops / threads / ISA	5 / 12 / AVX2, AVX	This host is using AVX2 rather than AVX-512.

The live data tells the whole story in one sentence: float AI is 0.031 FLOPs/byte with DRAM bandwidth around 32.6 GB/s. That is a deeply memory-bound decode path. Quantized integer arithmetic helps by moving the workload toward 0.895–0.927 FLOPs/byte, but even that is still far to the left of where compute peak would become the limiter.

This is also why decode and prefill behave differently. Decode is GEMV-like: one token, one pass, each weight row touched once. Prefill is GEMM-like: multiple prompt tokens reuse the same weight rows, so arithmetic intensity rises and compute ceilings start to matter.

Roofline model in three equations

text

X-axis:
  Arithmetic Intensity = FLOPs / bytes from DRAM
Y-axis:
  Attainable performance = GFLOP/s
Ceilings:
  Memory roof   = BW * AI
  Compute roof  = Peak GFLOP/s
  Attainable    = min(BW * AI, Peak GFLOP/s)

Decode versus prefill arithmetic intensity intuition

text

Decode (batch = 1):
  GEMV
  each weight row used once
  AI stays low
  usually memory-bound
Prefill (many prompt tokens):
  GEMM
  weight rows reused across tokens
  AI rises
  can become compute-bound

Theoretical bytes-per-weight framing for common dtypes

text

FP32 decode upper-bound intuition:
  2 FLOPs per weight / 4 bytes per weight = 0.5 FLOPs/byte
Q8_0 decode upper-bound intuition:
  2 FLOPs per weight / 1 byte per weight ≈ 2.0 FLOPs/byte
Measured float AI in this run:
  0.031 FLOPs/byte
Why measured < theoretical:
  extra metadata loads
  activations and outputs
  cache line waste
  thread synchronization
  imperfect reuse

Representative advisor_summary.json payload

json

{
  "summary_metrics": {
    "float_arithmetic_intensity": 0.031,
    "int_arithmetic_intensity": 0.895,
    "mixed_arithmetic_intensity": 0.927,
    "total_gflops": 0.807,
    "total_gintops": 23.04,
    "dram_bw_gb_s": 32.6,
    "l1_bw_gb_s": 3886.0,
    "l2_bw_gb_s": 1768.0,
    "l3_bw_gb_s": 730.0,
    "sp_fma_peak_gflops": 559.3,
    "dp_fma_peak_gflops": 253.3,
    "vectorized_loops_count": 5,
    "cpu_threads": 12,
    "isa_used": "AVX2, AVX"
  }
}

buildRooflineSvg() — chart setup and ridge point

javascript

function buildRooflineSvg(p) {
    const ai        = (p && Number.isFinite(Number(p.ai)))       ? Number(p.ai)       : null;
    const gflops    = (p && Number.isFinite(Number(p.gflops)))   ? Number(p.gflops)   : null;
    const dramBw    = (p && Number.isFinite(Number(p.dramBw)))   ? Number(p.dramBw)   : 29.0;
    const peakFp32  = (p && Number.isFinite(Number(p.peakFp32))) ? Number(p.peakFp32)
                    : (p && Number.isFinite(Number(p.peakSp)))   ? Number(p.peakSp)   : 576.0;
    const peakFp64  = (p && Number.isFinite(Number(p.peakFp64))) ? Number(p.peakFp64)
                    : (p && Number.isFinite(Number(p.peakDp)))   ? Number(p.peakDp)   : 288.0;
    const kernels   = (p && Array.isArray(p.kernels)) ? p.kernels : [];
    const fp32RidgeAI = peakFp32 / dramBw;
    const W = 680, H = 390;
    const PL = 68, PR = 648, PT = 32, PB = 310;
    const xLogMin = Math.log10(0.05),  xLogMax = Math.log10(200);
    const yLogMin = Math.log10(0.08),  yLogMax = Math.log10(2000);

buildRooflineSvg() — bandwidth roofs on log-log axes

javascript

const drawRoof = (bw, color, dash, label, tooltipDesc, labelDx = 5, labelDy = -5) => {
    const ridgeAI = peakFp32 / bw;
    const xStart  = Math.pow(10, xLogMin);
    const yStart  = bw * xStart;
    const x1 = toX(xStart), y1 = toY(Math.min(yStart, peakFp32));
    const x2 = toX(Math.min(ridgeAI, Math.pow(10, xLogMax)));
    const y2 = toY(Math.min(bw * Math.min(ridgeAI, Math.pow(10, xLogMax)), peakFp32));
    if (x2 > PL) {
        s += `<line x1="${Math.max(x1,PL).toFixed(1)}" y1="${Math.min(y1,PB).toFixed(1)}" x2="${Math.min(x2,PR).toFixed(1)}" y2="${Math.max(y2,PT).toFixed(1)}" stroke="${color}" stroke-width="1.8" stroke-dasharray="${dash}" opacity="0.85"/>`;
        const mx = (Math.max(x1,PL)+Math.min(x2,PR))/2;
        const my = (Math.min(y1,PB)+Math.max(y2,PT))/2;
        s += `<text x="${(mx+labelDx).toFixed(1)}" y="${(my+labelDy).toFixed(1)}" fill="${color}" font-size="8.5">${label}</text>`;
    }
};
drawRoof(dramBw, '#f39c12', 'none', `DRAM ${dramBw.toFixed(0)} GB/s`, 'DRAM roof');
drawRoof(200, '#3498db', '6,3', 'L2 ~200 GB/s', 'L2 roof');
drawRoof(800, '#9b59b6', '3,4', 'L1 ~800 GB/s', 'L1 roof');

buildRooflineSvg() — compute ceilings and ridge marker

javascript

const yFp32 = toY(peakFp32);
const xFp32Ridge = toX(fp32RidgeAI);
if (yFp32 >= PT && yFp32 <= PB) {
    s += `<line x1="${Math.min(xFp32Ridge,PR).toFixed(1)}" y1="${yFp32.toFixed(1)}" x2="${PR}" y2="${yFp32.toFixed(1)}" stroke="#47b475" stroke-width="1.8" opacity="0.95"/>`;
    s += `<text x="${PR-4}" y="${(yFp32-5).toFixed(1)}" text-anchor="end" fill="#47b475" font-size="8">FP32·FMA ${peakFp32.toFixed(0)} GF/s</text>`;
}
const dramRidgeAI = fp32RidgeAI;
const xRidge = toX(dramRidgeAI), yRidge = toY(peakFp32);
if (xRidge >= PL && xRidge <= PR) {
    s += `<line x1="${xRidge.toFixed(1)}" y1="${yRidge.toFixed(1)}" x2="${xRidge.toFixed(1)}" y2="${PB}" stroke="#f39c12" stroke-width="0.8" stroke-dasharray="2,4" opacity="0.5"/>`;
    s += `<circle cx="${xRidge.toFixed(1)}" cy="${yRidge.toFixed(1)}" r="4" fill="none" stroke="#f39c12" stroke-width="1.5"/>`;
}

buildRooflineSvg() — kernel-family dots and measured workload dot

javascript

kernels.forEach((k, i) => {
    if (!Number.isFinite(k.ai) || !Number.isFinite(k.gflops) || k.ai <= 0 || k.gflops <= 0) return;
    const kx = toX(k.ai), ky = toY(k.gflops);
    const labelOffset = 10 + ((i % 2) * 8);
    s += `<circle cx="${kx.toFixed(1)}" cy="${ky.toFixed(1)}" r="7" fill="${k.color||'#aaa'}" opacity="0.75"/>`;
    s += `<text x="${kx.toFixed(1)}" y="${(ky-labelOffset).toFixed(1)}" text-anchor="middle" fill="${k.color||'#aaa'}" font-size="7.4">${String(k.label || '')}</text>`;
});
if (ai !== null && gflops !== null && ai > 0 && gflops > 0) {
    const wx = toX(ai), wy = toY(gflops);
    s += `<circle cx="${wx.toFixed(1)}" cy="${wy.toFixed(1)}" r="10" fill="rgba(255,255,255,0.07)" stroke="rgba(255,255,255,0.3)"/>`;
    s += `<circle cx="${wx.toFixed(1)}" cy="${wy.toFixed(1)}" r="5" fill="#ffffff" stroke="rgba(0,0,0,0.6)" stroke-width="1.5"/>`;
    s += `<text x="${(wx+14).toFixed(1)}" y="${(wy-6).toFixed(1)}" fill="#ffffff" font-size="9" font-weight="700">Workload</text>`;
}

Ridge point math for the live run

text

Given:
  peak FP32 = 559.3 GFLOP/s
  DRAM BW   = 32.6 GB/s
Ridge point:
  ridge = peak / bandwidth
  ridge = 559.3 / 32.6
  ridge ≈ 17.16 FLOPs/byte
Measured float AI:
  0.031 FLOPs/byte
Distance from ridge:
  0.031 / 17.16 ≈ 0.18%

Roofline chart showing arithmetic intensity on a log-log axis with DRAM, L2, L1, and FP32 ceilings, placing Qwen3 decode far left in the memory-bound region.

Intel VTune — Microarchitecture Deep Dive

perf stat tells you the hardware symptoms. VTune tells you which code regions are causing those symptoms and, in many modes, how the machine spent its slots in Top-Down Microarchitecture Analysis Method terms: Retiring, Front-End Bound, Bad Speculation, and Back-End Bound, with the back end splitting further into memory-bound and core-bound behavior.

The hotspots list in this run is brutally concentrated. vec_dot_q8_0_q8_0_avx alone takes 14.73 seconds of CPU time, easily more than three quarters of the total hotspot budget. That immediately answers the optimization question: if you want the whole model faster, this is the one function you attack first.

This is Amdahl’s Law in its most useful operational form. If 75% of runtime sits in one kernel, a 2× improvement there lifts total speed by roughly 1.6×. If 0.3% of runtime sits in flash attention, heroic attention tuning will barely move end-to-end decode. VTune is not just a microscope. It is a priority engine.

Symbol	Time	Percentage
`vec_dot_q8_0_q8_0_avx`	14.73s	>75% of hotspot time
`worker_main`	2.45s	Thread-pool worker envelope
`gemm_nt_q8_0_q8_0_avx2`	0.64s	Prefill GEMM contribution
`__memset_avx2_unaligned_erms`	0.50s	Memory zeroing overhead
`ck_threadpool_dispatch`	0.22s	Dispatch overhead
`gemv_q8_0_q8_0_parallel_simd`	0.22s	Parallel GEMV envelope
`attention_flash_decode`	0.02s	<0.2%, already very fast
`quantize_row_q8_0`	0.02s	Activation quantization is minor

VTune hotspot CSV parser from vtune_artifacts_v7.py

python

def parse_hotspots_csv(path: Path, top_k: int = 25) -> List[Dict[str, object]]:
    if not path.exists():
        return []
    text = path.read_text(errors="ignore")
    if not text.strip():
        return []
    rows: List[Dict[str, object]] = []
    def _collect(reader: csv.DictReader) -> None:
        for row in reader:
            symbol = pick_text(row, ["Function", "Function/Call Stack", "Call Stack", "Source Function", "Module"])
            if not symbol:
                continue
            value = pick_value(row, ["CPU Time", "CPU Time:Self", "CPU Time:Total", "Effective Time", "Elapsed Time"])
            percent = pick_value(row, ["CPU Time:Self %", "CPU Time %", "Effective Time %", "Elapsed Time %"])
            rows.append({"symbol": symbol, "value": value if value is not None else 0.0, "percent": percent if percent is not None else 0.0})

VTune collection commands from ck_cli_v7.c

snprintf(cmd, sizeof(cmd), "vtune -collect hotspots -result-dir '%s' -quiet -- %s", vt_hot, base_train);
rc = run_shell_cmd(cmd);
if (rc != 0) return rc;
snprintf(cmd, sizeof(cmd), "vtune -report hotspots -result-dir '%s' -format text -report-output '%s' >/dev/null 2>&1", vt_hot, vt_hot_txt);
run_shell_cmd(cmd);
snprintf(cmd, sizeof(cmd), "vtune -collect memory-access -result-dir '%s' -quiet -- %s", vt_mem, base_train);
int mem_rc = run_shell_cmd(cmd);

Structure of the emitted vtune_summary.json

python

payload: Dict[str, object] = {
    "generated_at": utc_now_iso(),
    "analysis": "hotspots",
    "result_dir": primary.get("result_dir"),
    "report_path": primary.get("report_text"),
    "csv_path": primary.get("report_csv"),
    "top_hotspots": primary.get("top_hotspots", []),
    "hotspots": primary.get("hotspots", []),
    "raw_text": primary.get("raw_text", ""),
    "analyses": analyses,
    "analysis_metrics": {
        str(entry.get("name") or f"analysis_{i}"): entry.get("summary_metrics", {})
        for i, entry in enumerate(analyses)
    },
    "artifacts": artifacts,
}

Amdahl’s Law on the dominant VTune hotspot

text

If vec_dot_q8_0_q8_0_avx is 75% of total time:
Speedup_total = 1 / ((1 - p) + p / s)
With p = 0.75 and s = 2.0:
  Speedup_total = 1 / (0.25 + 0.75 / 2)
  Speedup_total = 1 / 0.625
  Speedup_total = 1.6x
Meaning:
  optimize the dominant dot kernel first

VTune hotspot breakdown emphasizing vec_dot_q8_0_q8_0_avx as the dominant CPU time sink and showing the thread-pool envelope around it.

Intel Advisor — Roofline with Hardware Measurements

VTune and Advisor overlap, but they are not the same tool. VTune is strongest when you want hotspot ranking and microarchitectural decomposition. Advisor is strongest when you want measured memory traffic and measured arithmetic intensity that place the workload on the roofline with fewer assumptions.

That distinction matters because timing-based arithmetic intensity estimates can be directionally correct but imprecise. Advisor’s roofline collection gives you structured measurements for float AI, integer AI, mixed AI, bandwidth ceilings, vectorization coverage, and loop counts. CKE stores the HTML report, CSV export, and summary JSON so the run can be read later without rerunning the tool.

The vectorization result is especially telling. Advisor sees 5 vectorized loops covering 76.8% of runtime and reports ISA usage as “AVX2, AVX.” So the system is not slow because it forgot to vectorize. It is slow because the vectorized work is waiting on bytes. Advisor moves the conversation from “did we SIMD this?” to “did SIMD actually move us off the memory roof?”

Advisor XML-to-summary key map in advisor_artifacts_v7.py

python

key_map = {
    "ProgramTime": "elapsed_time_s",
    "ElapsedTime": "elapsed_time_s",
    "TotalGFLOPS": "total_gflops",
    "TotalGFLOPCount": "total_gflop_count",
    "TotalGINTOPS": "total_gintops",
    "TotalGINTOPCount": "total_gintop_count",
    "TotalGMixedOPS": "total_gmixed_ops",
    "TotalGMixedOPCount": "total_gmixed_op_count",
    "TotalFloatAI": "float_arithmetic_intensity",
    "TotalIntAI": "int_arithmetic_intensity",
    "TotalMixedAI": "mixed_arithmetic_intensity",
    "TotalCPUTime": "total_cpu_time_s",
    "TimeInVectorizedLoops": "time_in_vectorized_loops_s",
    "TimeInScalarLoops": "time_in_scalar_loops_s",
    "TimeOutsideOfAnyLoop": "time_outside_loops_s",
    "VectorizedLoopsCount": "vectorized_loops_count",
    "CPUThreads": "cpu_threads",
}

Advisor roof items folded into flat summary metrics

python

roof = enrichment.get("advisum_metrics", {}).get("roof_items", [])
for item in roof:
    name = item.get("name", "")
    bw = item.get("bandwidth_gops")
    if not name or bw is None:
        continue
    if name == "DRAM Bandwidth":
        flat["dram_bw_gb_s"] = round(bw, 3)
    elif name == "SP Vector FMA Peak":
        flat["sp_fma_peak_gflops"] = round(bw, 3)
    elif name == "DP Vector FMA Peak":
        flat["dp_fma_peak_gflops"] = round(bw, 3)
    elif name == "L1 Bandwidth":
        flat["l1_bw_gb_s"] = round(bw, 3)
    elif name == "L2 Bandwidth":
        flat["l2_bw_gb_s"] = round(bw, 3)
    elif name == "L3 Bandwidth":
        flat["l3_bw_gb_s"] = round(bw, 3)

Advisor collection pipeline in ck_cli_v7.c

snprintf(cmd, sizeof(cmd), "advisor --collect=roofline --project-dir '%s' -- %s", adv_dir, base_train);
int collect_rc = run_shell_cmd(cmd);
if (collect_rc == 0) {
    snprintf(cmd, sizeof(cmd), "advisor --report=roofline --project-dir '%s' --format=text --report-output '%s' >/dev/null 2>&1", adv_dir, adv_txt);
    run_shell_cmd(cmd);
    snprintf(cmd, sizeof(cmd), "advisor --report=roofline --project-dir '%s' --format=csv --report-output '%s' >/dev/null 2>&1", adv_dir, adv_csv);
    run_shell_cmd(cmd);
    snprintf(cmd, sizeof(cmd), "advisor --report=roofline --project-dir '%s' --report-output '%s' >/dev/null 2>&1", adv_dir, adv_html);
    run_shell_cmd(cmd);
}

Representative Advisor summary fields used by the visualizer

json

{
  "analysis": "roofline",
  "summary_metrics": {
    "float_arithmetic_intensity": 0.031,
    "int_arithmetic_intensity": 0.895,
    "mixed_arithmetic_intensity": 0.927,
    "dram_bw_gb_s": 32.6,
    "l1_bw_gb_s": 3886.0,
    "l2_bw_gb_s": 1768.0,
    "l3_bw_gb_s": 730.0,
    "sp_fma_peak_gflops": 559.3,
    "dp_fma_peak_gflops": 253.3,
    "vectorized_loops_count": 5,
    "cpu_threads": 12,
    "isa_used": "AVX2, AVX"
  },
  "html_path": "advisor_roofline.html"
}

Flamegraphs — Where Time Actually Goes

Flamegraphs, introduced by Brendan Gregg in 2011, are a visual compression of stack-sampling data. The usual pipeline is simple: perf record captures stacks, perf script expands them, a stack-collapser folds repeated stacks, and a renderer turns the folded counts into an SVG. The width of each box is what matters: wider means more time. Height is call-stack depth. Color is largely decorative.

Flamegraphs and VTune tell the same story from different angles. VTune says the dominant symbol is the quantized dot kernel. The flamegraph says the dominant symbol is gemv_q8_0_q8_0_parallel_simd, and it also shows the call-stack context: that work sits under worker_main and ck_threadpool_dispatch.

That stack context is the big advantage. Hotspot tables answer “what function is expensive?” Flamegraphs add “who called it, and through what runtime envelope?” In this run, flamegraphs reinforce the same constraint chain as VTune: GEMV dominates, thread-pool orchestration is visible, flash attention is tiny.

Flamegraph generation path in the v7 Makefile

bash

perf record --all-user -F 999 --call-graph dwarf -o "$fg_perf_data" --     ./build/ck-cli-v7 "$runtime_dir/libmodel.so" "$runtime_dir/weights.bump"     --prompt "$fg_prompt" --max-tokens "$fg_max_tokens" --timing --quiet-output
perf script -i "$fg_perf_data" |     ./FlameGraph/stackcollapse-perf.pl |     tee "$fg_perf_folded" |     ./FlameGraph/flamegraph.pl --title="$fg_title" > "$fg_svg"

Folded-stack top-symbol parser in perf_artifacts_v7.py

python

def parse_folded_top_symbols(folded_path: Path, top_k: int = 25) -> List[Dict[str, object]]:
    samples: Dict[str, int] = {}
    for line in folded_path.read_text(errors="ignore").splitlines():
        line = line.strip()
        if not line:
            continue
        stack, count_s = line.rsplit(" ", 1)
        count = int(count_s)
        if ";" in stack:
            symbol = stack.split(";")[-1]
        else:
            symbol = stack
        samples[symbol] = samples.get(symbol, 0) + count
    ranked = sorted(samples.items(), key=lambda kv: kv[1], reverse=True)[:top_k]
    return [{"symbol": sym, "samples": cnt} for sym, cnt in ranked]

Representative flamegraph_manifest.json top symbols

json

{
  "mode": "decode",
  "top_symbols": [
    {"symbol": "gemv_q8_0_q8_0_parallel_simd", "samples": 4530000000},
    {"symbol": "worker_main", "samples": 1450000000},
    {"symbol": "gemm_nt_q8_0_q8_0", "samples": 828000000},
    {"symbol": "ck_threadpool_dispatch", "samples": 249000000},
    {"symbol": "__memcpy_avx_unaligned_erms", "samples": 156000000},
    {"symbol": "attention_flash_decode", "samples": 28000000},
    {"symbol": "quantize_row_q8_0", "samples": 7600000},
    {"symbol": "swiglu_forward", "samples": 5700000}
  ]
}

Mini folded-stack example that explains flamegraph width

text

worker_main;ck_threadpool_dispatch;gemv_q8_0_q8_0_parallel_simd 4530000000
worker_main;ck_threadpool_dispatch;gemm_nt_q8_0_q8_0 828000000
worker_main;ck_threadpool_dispatch;attention_flash_decode 28000000
Interpretation:
  first stack is massively wider than the others
  the leaf box owns most samples
  its parents also appear wide because they sit beneath it

Flamegraph anatomy highlighting wide top frames for GEMV, the worker_main thread-pool envelope, and the tiny footprint of flash attention.

Per-Operation Profiling — The Op-Level Breakdown

The external profilers tell you about hardware and symbols. CKE’s internal CK_PROFILE instrumentation tells you about model semantics. Every operator call can emit a CSV row with mode, token id, layer, op name, and time in microseconds.

That is how the report can say something stronger than “the matmul is hot.” It can say mlp_gate_up consumed 104.3 ms, logits consumed 79.1 ms, and attn consumed only 9.8 ms in the token-0 summary view. The surprise for many readers is that attention is only 2.6% of decode time here. The dominant cost is the MLP and final vocabulary projection.

That is exactly what the matrix shapes predict. The gate-up projection is the widest MLP expansion, and the logits projection hits a 151,936-token vocabulary. Those are giant GEMV-style reads, so they inherit the same bandwidth wall the roofline chart already warned about. 476 measurements That is 28 layers × 17 ops, enough resolution to see exactly which operator family is burning decode time.

Operation	Time µs	Percentage	Category
`mlp_gate_up`	104300	27.4%	MLP projection
`logits`	79100	20.8%	Final vocab projection
`mlp_down`	53200	14.0%	MLP projection
`out_proj`	38500	10.1%	Attention output projection
`q_proj`	35200	9.3%	Attention projection
`v_proj`	21700	5.7%	Attention projection
`k_proj`	20100	5.3%	Attention projection
`attn`	9800	2.6%	Flash attention core
`residual_add`	6700	1.8%	Elementwise

build_summary() from generate_profile_summary_v7.py

python

def build_summary(entries: List[Dict[str, str]]) -> Dict[str, object]:
    by_op: Dict[str, float] = {}
    by_layer: Dict[int, Dict[str, float]] = {}
    total_us = 0.0
    for e in entries:
        if e.get("token_id", "0") != "0":
            continue
        op = e.get("op", "unknown")
        layer = int(e.get("layer", -1))
        us = float(e.get("time_us", 0))
        total_us += us
        by_op[op] = by_op.get(op, 0.0) + us
        if layer >= 0:
            if layer not in by_layer:
                by_layer[layer] = {}
            by_layer[layer][op] = by_layer[layer].get(op, 0.0) + us

Representative rows from profile_decode.csv

csv

mode,token_id,layer,op,time_us
prefill,0,-1,embed,4200
prefill,0,0,q_proj,1180
prefill,0,0,k_proj,710
prefill,0,0,v_proj,760
decode,0,0,q_proj,1330
decode,0,0,k_proj,740
decode,0,0,v_proj,810
decode,0,0,attn,360
decode,0,0,mlp_gate_up,3950
decode,0,0,mlp_down,1880
decode,0,-1,logits,79100

Representative profile_summary.json by-op payload

json

{
  "total_us": 380890.0,
  "total_ms": 380.89,
  "by_op": {
    "mlp_gate_up": 104300.0,
    "logits": 79100.0,
    "mlp_down": 53200.0,
    "out_proj": 38500.0,
    "q_proj": 35200.0,
    "v_proj": 21700.0,
    "k_proj": 20100.0,
    "attn": 9800.0,
    "residual_add": 6700.0
  }
}

Full by-mode split kept in the same summary object

python

by_mode: Dict[str, Dict[str, object]] = {}
for e in entries:
    mode = e.get("mode", "unknown")
    op = e.get("op", "unknown")
    us = float(e.get("time_us", 0))
    bucket = by_mode.setdefault(mode, {"total_us": 0.0, "by_op": {}})
    bucket["total_us"] = float(bucket["total_us"]) + us
    op_map = bucket["by_op"]
    op_map[op] = op_map.get(op, 0.0) + us

The Heatmap — Visualizing Time Across Layers

Once you have hundreds of op measurements, tables stop being intuitive. That is why the IR Visualizer renders layer-by-op heatmaps. The layout is natural for a transformer: header work at the top, 28 body layers in the middle, and footer work like final norm and logits at the end.

Color intensity is just time concentration. Dark cells are cheap. Bright cells are expensive. Layer 0 is often a bit hotter because the cache is cold, and the footer often dominates because the vocabulary projection is the single largest matrix-vector multiply in the graph.

The heatmap is not merely decorative. It quickly answers questions like “is one layer pathological?”, “is the footer dominating?”, and “does the cost pattern change between prefill and decode?” Section × op mode is especially useful because it aggregates across layers and shows which operation family dominates each region of the network.

Heatmap cell accumulation logic in renderProfileHeatmap()

javascript

function renderProfileHeatmap(data) {
    const entries = Array.isArray(data?.entries) ? data.entries : [];
    const heatMode = document.getElementById('profileHeatmapMode')?.value || 'layer_op';
    const rowScope = document.getElementById('profileHeatmapRows')?.value || 'layers';
    const cellValues = new Map();
    const rowTotals = new Map();
    const colTotals = new Map();
    const addCell = (rowKey, colKey, us) => {
        const key = `${rowKey}||${colKey}`;
        cellValues.set(key, (cellValues.get(key) || 0) + us);
        rowTotals.set(rowKey, (rowTotals.get(rowKey) || 0) + us);
        colTotals.set(colKey, (colTotals.get(colKey) || 0) + us);
    };

Heatmap row and section normalization

javascript

entries.forEach((entry) => {
    const us = safeNum(entry?.time_us, 0);
    const op = String(entry?.op || 'unknown');
    const layer = parseInt(entry?.layer ?? -1, 10);
    const meta = getProfileEntryMeta(entry, metaIndex);
    const section = normalizeSectionName(meta?.section, op, layer);
    if (heatMode === 'section_op') {
        addCell(sectionLabel[section] || section, op, us);
    } else if (heatMode === 'layer_section') {
        let layerLabel = `L${layer}`;
        if (layer < 0) {
            if (section === 'header') layerLabel = 'Header';
            else if (section === 'footer') layerLabel = 'Footer';
            else layerLabel = 'Global';
        }
        addCell(layerLabel, sectionLabel[section] || section, us);
    }
});

Heatmap rendering and visible-scope normalization

javascript

let maxUs = 0;
sortedRows.forEach((rowKey) => {
    sortedCols.forEach((colKey) => {
        const v = cellValues.get(`${rowKey}||${colKey}`) || 0;
        if (v > maxUs) maxUs = v;
    });
});
if (maxUs <= 0) maxUs = 1;
html += '<div style="overflow-x:auto;"><table style="font-size:0.75rem;border-collapse:collapse;">';
html += '<thead><tr><th style="padding:4px 8px;">Row</th>';
sortedRows.forEach((rowKey) => {
    sortedCols.forEach((colKey) => {
        const us = cellValues.get(`${rowKey}||${colKey}`) || 0;
        const intensity = us / maxUs;
        const alpha = us <= 0 ? 0 : Math.min(1, 0.18 + 0.82 * Math.pow(intensity, 0.45));
        const bg = `rgba(255,180,0,${alpha.toFixed(3)})`;
        html += `<td style="background:${bg};text-align:center;">${us > 0 ? (us / 1000).toFixed(1) : ''}</td>`;
    });
});

Layer-by-operation heatmap showing bright MLP and logits regions, cooler attention cells, and a slightly hotter first layer due to cold-cache startup.

Theory of Constraints — The Bottleneck X-Ray

Eliyahu Goldratt’s Theory of Constraints says a system is only as fast as its tightest constraint. That sounds abstract until you apply it to CPU inference. Then it becomes brutally practical: if arithmetic intensity is far below the ridge point, the real bottleneck is memory bandwidth, not imagination, not hype, and not theoretical peak FLOPs.

CKE encodes that logic directly into the Profile tab. When Advisor data exists, it classifies the run as MEMORY-BOUND, BALANCED, or COMPUTE-BOUND using arithmetic intensity relative to the ridge point. When Advisor is missing, it falls back to IPC as a rough estimate.

For this Qwen3 decode run the conclusion is unambiguous. The constraint chain starts with DRAM bandwidth utilization, then compute throughput, then cache efficiency, then GEMM/GEMV time share, then branch prediction. Branch prediction ends up green. DRAM and cache behavior do not. This is why the IR report feels like an X-ray. It is not just plotting counters; it is ranking the weak links from weakest to strongest.

Constraint classification logic in renderProfileTOC()

javascript

const ai = physics.computeIntensity;
const ridgePoint = 10.0;
if (ai !== null && ai < ridgePoint * 0.6) {
    constraint = 'memory-bound';
    constraintTitle = 'MEMORY-BOUND';
    constraintExplain = `Arithmetic Intensity = ${ai.toFixed(3)} FLOPs/byte — well below the ridge point (~${ridgePoint} FLOPs/byte). The CPU's compute units are starved waiting for data.`;
    constraintAction = 'Reduce memory traffic: better quantization, tighter cache blocking, prefetch hints, weight layout optimization.';
} else if (ai !== null && ai >= ridgePoint * 0.6 && ai < ridgePoint * 1.4) {
    constraint = 'balanced';
} else if (ai !== null && ai >= ridgePoint * 1.4) {
    constraint = 'compute-bound';
} else if (ipcVal !== null) {
    if (ipcVal < 1.0) constraint = 'memory-bound';
}

Constraint chain link: DRAM bandwidth and compute throughput

javascript

if (physics.memoryBwGBs !== null) {
    const dramPeak = 29.0;
    const utilPct = (physics.memoryBwGBs / dramPeak) * 100;
    chain.push({
        label: 'DRAM Bandwidth',
        value: `${physics.memoryBwGBs.toFixed(1)} GB/s`,
        capacity: `${dramPeak.toFixed(0)} GB/s peak`,
        utilization: utilPct,
        insight: utilPct > 70
            ? 'Near DRAM saturation — this is the wall.'
            : 'Moderate DRAM pressure.'
    });
}
if (physics.gflops !== null) {
    const estPeak = 288.1;
    const utilPct = (physics.gflops / estPeak) * 100;
    chain.push({
        label: 'Compute Throughput',
        value: `${physics.gflops.toFixed(2)} GFLOP/s`,
        capacity: `~${estPeak.toFixed(0)} GFLOP/s peak (DP FMA)`,
        utilization: utilPct,
        insight: utilPct > 60 ? 'Good compute utilization' : 'Low compute utilization — data starvation or scalar code is wasting SIMD width.'
    });
}

Constraint chain link: cache efficiency, GEMM share, branch prediction

javascript

if (cmrVal !== null) {
    const missRate = cmrVal * 100;
    chain.push({
        label: 'Cache Efficiency',
        value: `${missRate.toFixed(2)}% miss rate`,
        capacity: '<1% ideal for streaming workloads',
        insight: missRate < 2 ? 'Excellent cache hit rate' : 'High cache miss rate — this is a significant bottleneck.'
    });
}
if (gemmPct > 0) {
    chain.push({
        label: 'GEMM/GEMV Time Share',
        value: `${gemmPct.toFixed(1)}%`,
        capacity: '>85% ideal (matmul should dominate)',
        insight: gemmPct > 80 ? 'Matmul dominates — overhead is minimal.' : 'Low GEMM share — non-matmul overhead is the constraint.'
    });
}
if (bmr !== undefined) {
    const missRate = Number(bmr) * 100;
    chain.push({
        label: 'Branch Prediction',
        value: `${missRate.toFixed(2)}% miss rate`,
        capacity: '<0.5% ideal',
        insight: missRate < 1 ? 'Excellent branch behavior.' : 'Too many mispredicts.'
    });
}

Constraint summary for the live Qwen3 decode case

text

Classification:
  MEMORY-BOUND
Primary chain, weakest first:
  1. DRAM Bandwidth utilization
  2. Compute Throughput utilization
  3. Cache Efficiency
  4. GEMM/GEMV Time Share
  5. Branch Prediction efficiency
Narrative:
  AI is far below ridge point
  cache miss rate is 63.9%
  branch behavior is excellent
  the wall is DRAM bandwidth

The Perf Gate — Automated Budget Enforcement

Performance engineering is not finished when you understand a bottleneck once. It is finished when regressions are stopped automatically. That is the role of the v7 perf gate: turn expected decode throughput and hardware-health floors into merge-blocking budgets.

For the model families discussed here, the defaults are straightforward. min_decode_tok_s is 8.0, min_ipc is 0.6, max_cache_miss_rate is 0.25, and max_branch_miss_rate is 0.08. If a change drops decode throughput to 6 tok/s or pushes cache misses well past the expected band, the CI gate fails.

This is what makes performance a contract instead of a dashboard screenshot. Budgets can also be overridden per family with environment variables such as CK_V7_PERF_QWEN3_MIN_DECODE_TOK_S. That lets teams tighten or relax expectations without rewriting the evaluator. CI is where performance intent becomes institutional memory.

Metric	Budget	Actual	Status
Decode throughput	≥ 8.0 tok/s	15.7 tok/s	PASS
IPC	≥ 0.6	1.42	PASS
Cache miss rate	≤ 25%	63.9%	FAIL if strict budget is applied
Branch miss rate	≤ 8%	0.52%	PASS
Family	qwen3	qwen3	Matched

resolve_budgets() from perf_gate_v7.py

python

def resolve_budgets(family: str) -> Dict[str, float]:
    base = {
        "min_decode_tok_s": 8.0,
        "min_ipc": 0.6,
        "max_cache_miss_rate": 0.25,
        "max_branch_miss_rate": 0.08,
    }
    family_defaults = {
        "qwen2": {"min_decode_tok_s": 8.0},
        "qwen3": {"min_decode_tok_s": 8.0},
        "gemma": {"min_decode_tok_s": 8.0},
    }
    if family in family_defaults:
        base.update(family_defaults[family])

Environment-variable overrides inside resolve_budgets()

python

overrides = {
    "min_decode_tok_s": parse_env_float(
        f"CK_V7_PERF_{env_family}_MIN_DECODE_TOK_S",
        "CK_V7_PERF_MIN_DECODE_TOK_S",
    ),
    "min_ipc": parse_env_float(
        f"CK_V7_PERF_{env_family}_MIN_IPC",
        "CK_V7_PERF_MIN_IPC",
    ),
    "max_cache_miss_rate": parse_env_float(
        f"CK_V7_PERF_{env_family}_MAX_CACHE_MISS_RATE",
        "CK_V7_PERF_MAX_CACHE_MISS_RATE",
    ),
    "max_branch_miss_rate": parse_env_float(
        f"CK_V7_PERF_{env_family}_MAX_BRANCH_MISS_RATE",
        "CK_V7_PERF_MAX_BRANCH_MISS_RATE",
    ),
}

Comparison helpers used by the gate

python

def compare_ge(value: Optional[float], threshold: float) -> Tuple[bool, str]:
    if value is None:
        return False, "missing"
    return value >= threshold, "ok" if value >= threshold else "below_threshold"
def compare_le(value: Optional[float], threshold: float) -> Tuple[bool, str]:
    if value is None:
        return False, "missing"
    return value <= threshold, "ok" if value <= threshold else "above_threshold"

Decode throughput extraction from profile entries

python

def compute_decode_tok_s(profile: Dict) -> Tuple[Optional[float], Dict[str, float]]:
    entries = profile.get("entries")
    if isinstance(entries, list) and entries:
        decode_entries = [e for e in entries if str(e.get("mode", "")) == "decode"]
        if decode_entries:
            total_decode_us = sum(_safe_float(e.get("time_us")) for e in decode_entries)
            token_ids = set()
            for e in decode_entries:
                token_ids.add(int(e.get("token_id", 0)))
            decode_tokens = len(token_ids)
            if total_decode_us > 0 and decode_tokens > 0:
                tok_s = decode_tokens * 1_000_000.0 / total_decode_us
                return tok_s, {"decode_total_us": total_decode_us, "decode_tokens": float(decode_tokens)}
    return None, {}

Perf-gate override examples for CI or local experiments

bash

export CK_V7_PERF_QWEN3_MIN_DECODE_TOK_S=12.0
export CK_V7_PERF_QWEN3_MIN_IPC=0.8
export CK_V7_PERF_QWEN3_MAX_CACHE_MISS_RATE=0.70
export CK_V7_PERF_QWEN3_MAX_BRANCH_MISS_RATE=0.02
python3 version/v7/scripts/perf_gate_v7.py --model-dir <model-dir>

Representative perf_gate_report.json interpretation

json

{
  "family": "qwen3",
  "checks": {
    "decode_tok_s": {"budget": 8.0, "actual": 15.7, "status": "ok"},
    "ipc": {"budget": 0.6, "actual": 1.42, "status": "ok"},
    "cache_miss_rate": {"budget": 0.25, "actual": 0.639, "status": "above_threshold"},
    "branch_miss_rate": {"budget": 0.08, "actual": 0.0052, "status": "ok"}
  },
  "overall_ok": false
}

The IR Visualizer — How CKE Puts It All Together

The IR Visualizer is the glue that turns a folder full of JSON and SVG files into a coherent observability surface. In v7 it is a single offline HTML file with embedded JavaScript, totaling about 24,950 lines. Its 11 tabs span Memory, Kernel Flow, Interpretability, Weight Dtype Audit, Parity Debug, Dataflow Graph, Tests, Statistics, Backprop IR, Data & Tokenizer, and Profile.

The Profile tab is where the performance story converges. It bootstraps embedded JSON blobs, detects which tools were available on the original host, renders guidance when some artifacts are missing, and then merges the six profiling sources into one dashboard. That is why a silicon engineer can open one HTML file and immediately inspect counters, hotspots, rooflines, flamegraphs, cache stats, and op heatmaps.

The offline nature matters. Reports can be archived, emailed, attached to bug reports, or shared with a vendor without requiring a live Python service or a profiler installation on the receiving machine. One file, one run, one observability bundle.

bootstrapFromEmbeddedData() wiring in the visualizer

javascript

function bootstrapFromEmbeddedData() {
    if (embeddedBootstrapped) return true;
    const embedded = window.EMBEDDED_IR_DATA;
    if (!embedded || !embedded.files) return false;
    embeddedMeta = embedded.meta || null;
    const files = embedded.files;
    setModeData('decode', 'ir1', files.ir1_decode || null);
    setModeData('decode', 'layout', files.layout_decode || null);
    setModeData('decode', 'call', files.lowered_decode_call || null);
    setModeData('decode', 'lowered', files.lowered_decode || files.lowered_decode_call || null);
    perfStatData = files.perf_stat_summary || perfStatData;
    flamegraphManifestData = files.flamegraph_manifest || flamegraphManifestData;
    cachegrindSummaryData = files.cachegrind_summary || cachegrindSummaryData;
    vtuneSummaryData = files.vtune_summary || vtuneSummaryData;
    advisorSummaryData = files.advisor_summary || advisorSummaryData;
    perfGateData = files.perf_gate_report || perfGateData;
}

renderProfileHowTo() detects host-tool availability

javascript

function renderProfileHowTo() {
    const profileToolStatus = (embeddedMeta && typeof embeddedMeta.profile_tool_status === 'object')
        ? embeddedMeta.profile_tool_status
        : {};
    const linuxHost = String(profileToolStatus.host_platform || '').toLowerCase().startsWith('linux');
    const perfInstalled = Boolean(profileToolStatus.perf);
    const flamegraphInstalled = Boolean(profileToolStatus.flamegraph);
    const cachegrindInstalled = Boolean(profileToolStatus.valgrind) && Boolean(profileToolStatus.cg_annotate);
    const vtuneInstalled = Boolean(profileToolStatus.vtune);
    const advisorInstalled = Boolean(profileToolStatus.advisor);
    const coreHostReady = linuxHost && perfInstalled && flamegraphInstalled && cachegrindInstalled;
}

perf_artifacts_v7.py writes normalized JSON outputs

python

if args.perf_stat and args.perf_stat.exists():
    perf_summary = parse_perf_stat_text(args.perf_stat.read_text(errors="ignore"))
    perf_summary["source"] = str(args.perf_stat)
    perf_summary["cpu_topology"] = collect_cpu_topology()
    write_json(out_dir / "perf_stat_summary.json", perf_summary)
if args.flamegraph_svg and args.flamegraph_svg.exists():
    top_symbols = parse_folded_top_symbols(args.folded) if args.folded else []
    manifest = {
        "generated_at": utc_now_iso(),
        "svg_path": str(args.flamegraph_svg),
        "top_symbols": top_symbols,
    }
    write_json(out_dir / "flamegraph_manifest.json", manifest)
if args.vtune_summary and args.vtune_summary.exists():
    vtune_payload = json.loads(args.vtune_summary.read_text())
    write_json(out_dir / "vtune_summary.json", vtune_payload)

End-to-end generation path described in operational terms

bash

make profile-v7-decode
python3 version/v7/scripts/generate_profile_summary_v7.py --work-dir <model-dir>
make profile-v7-perf-stat
python3 version/v7/scripts/perf_artifacts_v7.py --model-input <model> --perf-stat build/ck_v7_perf_stat.txt
make profile-v7-flamegraph
make profile-v7-vtune
make profile-v7-advisor
make profile-v7-cachegrind
python3 version/v7/tools/open_ir_visualizer.py --generate --run <model-dir> --html-only --strict-run-artifacts

What This Means for AI on CPU

The first conclusion is blunt: single-token decode on CPU is almost always memory-bound. That is not a software embarrassment. It is the physics of batch-1 GEMV. Each weight row is consumed once, so more compute units help less than more effective bytes per second.

That is also why quantization is fundamentally a memory-bandwidth optimization. Q4 moves roughly one quarter the weight bytes of FP32. Q8 moves one quarter the bytes of FP32 as well. The compute instructions change too, but the first-order win is that fewer bytes need to cross the slowest link in the machine.

Prefill changes the math because it increases reuse. Once many prompt tokens are processed together, the same weight rows participate in more arithmetic, arithmetic intensity rises, and compute throughput starts to matter more. That is where AVX-512, AMX, or future SVE2/SME-style matrix capabilities matter most.

This creates a strategic split in optimization policy. Single-user decode wants bandwidth: faster DRAM, more channels, tighter quantization, and better layouts. Multi-user batched serving wants both bandwidth and compute, because the operating point slides rightward on the roofline. 844.8 GB/s That kind of channel-rich memory subsystem directly targets the real bottleneck of batch-1 decode: DRAM throughput.

That is why server-class ARM parts are so interesting for CPU inference. A Neoverse V3-class system with 12 DDR5 channels at 8800 MT/s implies 12 × 8 bytes × 8.8 GT/s = 844.8 GB/s of raw memory bandwidth. For a decode workload trapped behind DRAM, that number is not a detail. It is the system thesis.

Decode versus prefill optimization checklist

text

Decode:
  batch = 1
  GEMV-like
  weight rows streamed once
  memory-bound
  optimize bytes moved
Prefill:
  batch > 1 / prompt tokens > 1
  GEMM-like
  weight rows reused
  can become compute-bound
  optimize vector width and fused compute too

Neoverse V3 memory-bandwidth arithmetic

text

Assume:
  12 memory channels
  DDR5-8800 MT/s
  64-bit channel width = 8 bytes
Bandwidth:
  12 * 8 bytes * 8.8e9 transfers/s
  = 844.8e9 bytes/s
  = 844.8 GB/s raw theoretical bandwidth
For memory-bound decode:
  more channels can mean more tokens/s

AI-on-CPU implications chart connecting memory-bound decode, quantization as byte reduction, compute-bound batched prefill, and high-channel-count server memory systems.

Conclusion — The Performance Observatory

The deepest point of this post is that CKE does not just run inference. It diagnoses its own performance. That is a different class of software competence.

Six profiling artifacts, eleven visualization tabs, per-op instrumentation, roofline reasoning, hotspot attribution, cache analysis, and automated perf gates together form a real performance observatory. That is exactly what silicon vendors need when they ask not “does it run on our chip?” but “what is the bottleneck on our chip, and what should we optimize next?” The answer here is clear: Qwen3 decode on this AVX2 hybrid CPU is constrained by memory movement, concentrated inside the quantized dot/GEMV path.

That connects cleanly back to the recent CPU-kernel series. The SIMD deep dive explained the SIMD ladder. The ARM NEON post put ARM vector kernels in the same frame. The quantization post showed why byte reduction is the CPU story. The flash attention post showed why attention matters yet is not the main bottleneck in this run. Next up: DeltaNet and hybrid attention architectures, where the observability question becomes even more important because the runtime surface is more heterogeneous.

The observatory in one line

text

6 profiling artifacts
x 11 visualizer tabs
x automated perf gates
= one portable performance observatory for CPU inference

Follow the series

Read the previous CPU-kernel posts: SIMD Deep Dive, ARM NEON and CKE, Quantization Deep Dive, and Flash Attention on CPU.

Follow the implementation in C-Kernel-Engine on GitHub, the CKE documentation hub, and the companion videos on ANTSHiV Robotics YouTube.

CPU Performance Engineering for AI: Rooflines, Flamegraphs, VTune, and Perf Gates

What this post covers

Introduction — Performance Analysis Is the Real Moat

The Hardware Counters — What `perf stat` Measures

The Cache Hierarchy — Where Bytes Live and Die

Branch Prediction — The Pipeline's Crystal Ball

Roofline Analysis — The Most Important Chart in Performance Engineering

Intel VTune — Microarchitecture Deep Dive

Intel Advisor — Roofline with Hardware Measurements

Flamegraphs — Where Time Actually Goes

Per-Operation Profiling — The Op-Level Breakdown

The Heatmap — Visualizing Time Across Layers

Theory of Constraints — The Bottleneck X-Ray

The Perf Gate — Automated Budget Enforcement

The IR Visualizer — How CKE Puts It All Together

What This Means for AI on CPU

Conclusion — The Performance Observatory

Follow the series

ShivasNotes

Explore

Connect

CPU Performance Engineering for AI: Rooflines, Flamegraphs, VTune, and Perf Gates

What this post covers

Introduction — Performance Analysis Is the Real Moat

The Hardware Counters — What perf stat Measures

The Cache Hierarchy — Where Bytes Live and Die

Branch Prediction — The Pipeline's Crystal Ball

Roofline Analysis — The Most Important Chart in Performance Engineering

Intel VTune — Microarchitecture Deep Dive

Intel Advisor — Roofline with Hardware Measurements

Flamegraphs — Where Time Actually Goes

Per-Operation Profiling — The Op-Level Breakdown

The Heatmap — Visualizing Time Across Layers

Theory of Constraints — The Bottleneck X-Ray

The Perf Gate — Automated Budget Enforcement

The IR Visualizer — How CKE Puts It All Together

What This Means for AI on CPU

Conclusion — The Performance Observatory

Follow the series

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect

The Hardware Counters — What `perf stat` Measures