CPU performance engineering · C-Kernel-Engine
This ShivasNotes deep dive is written for CPU silicon teams and systems engineers who need to understand not just whether an AI framework is fast, but why it is or isn't fast, and exactly where the bottleneck lives. C-Kernel-Engine doesn't just run inference — it generates a full performance observatory with roofline analysis, flamegraphs, VTune hotspots, cachegrind attribution, and automated perf gates. This post teaches what each tool measures, how to read it, and what it reveals about AI workloads on CPU. Video walkthroughs are published on youtube.com/@antshivrobotics.
The difference between a toy inference runtime and a serious CPU platform is not whether text appears on screen. It is whether the runtime can explain every cycle, every miss, every stall, every hot loop, and every regression threshold in plain artifacts a silicon team can inspect. The recent ShivasNotes CPU-kernel series built the groundwork: SIMD, NEON, quantization, and flash attention. This post is where those ideas are measured against the machine itself.
What this post covers
Sections 1 through 4 establish the language of CPU performance engineering: counters, IPC, caches, TLBs, and branch prediction. Sections 5 through 8 move into the four big observability tools: roofline, VTune, Advisor, and flamegraphs.
Sections 9 through 13 zoom into CKE’s own per-op profiler, heatmaps, Theory of Constraints dashboard, perf gate, and the IR Visualizer pipeline that fuses everything into one offline HTML report. Sections 14 and 15 close with what these measurements mean for AI on CPU and why this matters to silicon vendors.
Introduction — Performance Analysis Is the Real Moat
A surprising amount of AI software stops at the sentence “the model runs.” That sentence is operationally useful, but it is not engineering mastery. The real moat is knowing exactly why the model runs at its current speed, exactly which subsystem is constraining it, and exactly which change would move the ceiling.
Most AI frameworks still treat the CPU as a black box: launch the kernel, time the end-to-end request, and hope the compiler and hardware sort it out. C-Kernel-Engine does the opposite. It treats the CPU as a first-class target with explicit profiling hooks, structured summaries, and a report pipeline designed for post-mortem analysis.
That framing is what makes CKE interesting to silicon teams. The code does not merely say “AVX2 path exists.” It also says “here is the measured IPC, here is the cache miss rate, here is the roofline position, and here is the one function consuming three quarters of total time.” 6 artifacts Perf stat, VTune, Advisor, flamegraph, cachegrind, and per-op profiling all land in one report bundle.
On a single Qwen3-0.6B-Q8_0 run, CKE automatically materializes six profiling artifacts: perf_stat_summary.json, vtune_summary.json, advisor_summary.json, flamegraph_manifest.json, cachegrind_summary.json, and profile_summary.json. Those files are then ingested by the IR Visualizer’s Profile tab, which sits inside a larger 11-tab offline report totaling roughly 24,950 lines of HTML and JavaScript.
Model: Qwen3-0.6B-Q8_0
Host: Intel hybrid CPU, AVX2+FMA, 12 threads auto-detected
Artifacts emitted:
perf_stat_summary.json
vtune_summary.json
advisor_summary.json
flamegraph_manifest.json
cachegrind_summary.json
profile_summary.json
Visualizer aggregation target:
ir_report.html
Profile tab consumes all six sources
Total visualizer size ≈ 24,950 lines HTML/JS
Total tabs = 11Model : Qwen3-0.6B-Q8_0
Kernel : gemm_avx2
Threads : 12
Prefill:
12 tokens
387.6 ms
31.0 tok/s
Decode:
31 tokens
1976.2 ms
15.7 tok/s
63.7 ms/tokmake profile-v7-decode
make profile-v7-perf-stat
make profile-v7-flamegraph
make profile-v7-vtune
make profile-v7-advisor
make profile-v7-cachegrind
python3 version/v7/tools/open_ir_visualizer.py --generate --run <model-dir>
The Hardware Counters — What perf stat Measures
Hardware performance counters are small accounting registers inside the CPU’s Performance Monitoring Unit, or PMU. They count events the core really experienced: retired instructions, elapsed cycles, cache lookups, cache misses, branches, branch misses, TLB activity, and page faults. They are the machine telling you what happened, not a profiler guessing from symbols alone.
One of the most important derived metrics is IPC, instructions per cycle. The pipeline roughly moves through fetch → decode → execute → retire. If the machine retires 1.42 instructions per clock, as this run does, it means the core is doing meaningful work but still spending a visible fraction of time waiting on data or execution resources.
A rough engineering shorthand is useful here. IPC below 1.0 usually signals a stall-heavy workload. IPC above 2.0 is generally healthy utilization. IPC above 3.0 is excellent on ordinary scalar-plus-vector server/client code. Qwen3 decode lands at IPC = 1.42: not catastrophic, not brilliant, and exactly what you expect from memory-sensitive batch-1 inference.
Hybrid Intel parts complicate the story in a good way. The PMU can expose both cpu_core and cpu_atom event domains, so CKE can track P-core and E-core behavior separately. In this run the P-cores show lower IPC at 1.27 because they are handling heavier SIMD and memory pressure, while E-cores reach 1.61 on lighter work.
| Counter | P-core value | E-core value | Derived metric |
|---|---|---|---|
instructions | 35.6B | 15.0B | 50.6B total retired instructions |
cycles | 28.0B | 9.3B | 35.6B total cycles observed |
IPC | 1.27 | 1.61 | 1.42 overall IPC |
cache-references | 0.69B | 0.32B | 1.01B total lookups |
cache-misses | 0.46B | 0.19B | 63.9% miss rate overall |
branches | 1.20B | 0.57B | 1.77B branches total |
branch-misses | 6.3M | 3.0M | 0.52% miss rate |
dTLB-loads | 6.7B | 3.38B | 10.08B total |
dTLB-load-misses | 7.7M | 3.7M | 0.11% load miss rate |
perf stat event set wired into the v7 Makefileperf_events="cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,branches,branch-misses,stalled-cycles-frontend,stalled-cycles-backend,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,minor-faults,major-faults"
if perf list 2>/dev/null | grep -q "dtlb_load_misses.walk_completed"; then
perf_events="$perf_events,cpu_core/dtlb_load_misses.walk_completed/,cpu_core/dtlb_store_misses.walk_completed/,cpu_core/itlb_misses.walk_completed/,cpu_core/dtlb_load_misses.stlb_hit/,cpu_core/itlb_misses.stlb_hit/"
fi
perf stat --all-user -e "$perf_events" ./build/ck-cli-v7 <runtime> --prompt "The quick brown fox" --max-tokens 32 --timing --quiet-output 2> build/ck_v7_perf_stat.txtperf_artifacts_v7.pydef parse_perf_stat_text(text: str) -> Dict[str, object]:
counters: Dict[str, float] = {}
notes: Dict[str, str] = {}
elapsed_sec: Optional[float] = None
rows = []
line_re = re.compile(r"^\s*([<\w\.,-]+)\s+([A-Za-z0-9_\-\.\/:]+)(?:\s+(?:#\s*)?(.*))?$")
elapsed_re = re.compile(r"^\s*([0-9]*\.?[0-9]+)\s+seconds\s+time\s+elapsed")
for line in text.splitlines():
m_elapsed = elapsed_re.search(line)
if m_elapsed:
try:
elapsed_sec = float(m_elapsed.group(1))
except ValueError:
pass
continue
m = line_re.match(line)
if not m:
parsed_csv = parse_perf_stat_csv_line(line)
if parsed_csv is None:
continue
metric, value, note = parsed_csv
else:
raw_value, metric, note = m.groups()perf_artifacts_v7.pyif inst is not None and cyc and cyc > 0:
derived["ipc"] = inst / cyc
if cache_ref and cache_ref > 0 and cache_miss is not None:
derived["cache_miss_rate"] = cache_miss / cache_ref
if branches and branches > 0 and branch_miss is not None:
derived["branch_miss_rate"] = branch_miss / branches
if dtlb_loads and dtlb_loads > 0 and dtlb_load_miss is not None:
derived["dtlb_load_miss_rate"] = dtlb_load_miss / dtlb_loads
if dtlb_stores and dtlb_stores > 0 and dtlb_store_miss is not None:
derived["dtlb_store_miss_rate"] = dtlb_store_miss / dtlb_stores
if has_page_walks:
derived["page_walks"] = page_walks
if minor_faults is not None and inst and inst > 0:
derived["minor_faults_per_kinst"] = (float(minor_faults) * 1000.0) / float(inst)
if major_faults is not None and inst and inst > 0:
derived["major_faults_per_kinst"] = (float(major_faults) * 1000.0) / float(inst)perf_stat_summary.json fields for the Qwen3 run{
"derived": {
"ipc": 1.42,
"cache_miss_rate": 0.639,
"branch_miss_rate": 0.0052,
"dtlb_load_miss_rate": 0.0011,
"dtlb_store_miss_rate": 0.0468,
"page_walks": 17200000.0
},
"counters": {
"instructions": 50600000000.0,
"cycles": 35600000000.0,
"cache-references": 1010000000.0,
"cache-misses": 646900000.0,
"branches": 1770000000.0,
"branch-misses": 9290000.0
},
"elapsed_seconds": 0.89
}The Cache Hierarchy — Where Bytes Live and Die
Modern CPU performance is a geography problem. The closer the bytes are to the core, the cheaper the access. As a mental model, think L1 data cache at roughly 32 KB per core and about a nanosecond away, L2 around 1.25 MB and a few nanoseconds away, L3 shared across cores at tens of megabytes and perhaps 10–15 ns away, and DRAM far away at 50–100 ns or worse.
That latency ladder is why cache statistics matter so much. This run reports 1.01B cache references and 646.9M cache misses, a 63.9% miss rate. That number looks shocking until you remember what decode is doing: batch-1 GEMV streams weight rows through the machine once per token, so the working set behaves more like a read-once river than a hot reusable tile.
Qwen3-0.6B in Q8_0 is roughly a 640 MB weight footprint before metadata overhead. The shared L3 on a client Intel part is more like 18–30 MB. Streaming 640 MB through a 24 MB last-level cache once per token is not a “maybe miss” scenario. It is a guaranteed capacity miss story. Quantization helps because it shrinks the river. It does not magically turn decode into an L1-resident compute kernel.
| Level | Typical capacity | Approx latency | Why it matters here |
|---|---|---|---|
| L1D | ~32 KB per core | ~1 ns | Only tiny hot vectors and loop state reliably live here. |
| L2 | ~1.25 MB per core | ~4 ns | Good for blocked kernels, too small for decode weight streaming. |
| L3 / LLC | ~18–30 MB shared | ~10–15 ns | Acts as a pressure valve, not a home, for 640 MB of weights. |
| DRAM | System memory | ~50–100 ns | The real wall for single-token decode. |
The last-level-cache counters make the story more concrete. On the P-cores, LLC loads are 16.8M and LLC load misses are 6.5M, which is a 38.7% LLC miss rate. Those misses are the accesses that escape the chip-level cache and become real DRAM traffic.
The TLB story is gentler but still relevant. There are 10.08B dTLB loads and 11.4M dTLB load misses, only 0.11%, yet that still becomes 17.2M total page walks. When large weight tensors span many 4 KB pages, huge pages can reduce translation overhead even if the main bottleneck remains bandwidth.
Closest first:
L1D -> tiny, extremely fast, per-core
L2 -> bigger, still local, per-core
L3 -> shared, much larger, still on-die
DRAM -> huge, slow, off-core and off-cache hierarchy
Decode implication:
weight rows stream from deeper levels
reuse is low at batch = 1
misses are not an accident
misses are the workload shapeQwen3-0.6B, Q8_0:
~0.6B parameters
~1 byte / weight -> ~600 MB raw payload
plus scales / metadata -> ~640 MB practical footprint
Client Intel LLC:
~24 MB shared cache
Per decode token:
each row used once
each layer streams its weights again
640 MB >> 24 MB
Result:
high capacity miss rate is expected{
"dTLB-loads": 10080000000.0,
"dTLB-load-misses": 11400000.0,
"dtlb_load_miss_rate": 0.0011,
"dTLB-stores": 160000000.0,
"dTLB-store-misses": 7490000.0,
"dtlb_store_miss_rate": 0.0468,
"page_walks": 17200000.0,
"minor-faults": 394000.0,
"major-faults": 16.0
}cat /sys/kernel/mm/transparent_hugepage/enabled
cat /proc/meminfo | grep -E 'Huge|AnonHuge'
getconf PAGE_SIZE
# If the platform allows explicit huge pages:
sysctl vm.nr_hugepages=4096
numactl --hardwaredef derive_metrics(totals: Dict[str, int]) -> Dict[str, float]:
ir = float(totals.get("Ir", 0))
dr = float(totals.get("Dr", 0))
dw = float(totals.get("Dw", 0))
d1mr = float(totals.get("D1mr", 0))
d1mw = float(totals.get("D1mw", 0))
llmr = float(totals.get("LLmr", 0))
llmw = float(totals.get("LLmw", 0))
data_refs = dr + dw
d1_misses = d1mr + d1mw
ll_misses = llmr + llmw
out: Dict[str, float] = {}
if data_refs > 0:
out["d1_miss_rate"] = d1_misses / data_refs
out["ll_miss_rate"] = ll_misses / data_refs
out["ll_miss_given_d1_miss"] = (ll_misses / d1_misses) if d1_misses > 0 else 0.0
return out
Branch Prediction — The Pipeline's Crystal Ball
Branches are the pipeline’s gamble on the future. When the CPU sees an if, a loop back-edge, or any control-flow fork, it predicts which path will be needed next so fetch and decode keep moving. A wrong guess flushes partially built work and can cost on the order of 15 cycles or more on a modern out-of-order core.
The good news is that well-structured numerical kernels are usually predictable. Inner SIMD loops tend to be regular, count-controlled, and free of data-dependent control flow. That is exactly what the CKE counters show: 1.77B branches and only 9.29M misses, for a branch miss rate of 0.52%.
This is one of the cleanest signals in the whole report. Branch prediction is not the bottleneck here. The hot loops are straight-line numeric code, so the machine is mostly losing time on data movement rather than speculative control flow. Modern CPUs often exceed 99% branch prediction accuracy on regular loops. A 0.52% miss rate is exactly the kind of number you want to see in vectorized inference code.
{
"branches": 1770000000.0,
"branch-misses": 9290000.0,
"branch_miss_rate": 0.0052,
"interpretation": "excellent branch behavior"
}for (block = 0; block < nb; ++block) {
load bytes
widen lanes
multiply
accumulate
reduce
}
Control flow:
predictable loop counter
predictable exit test
minimal data-dependent branching
Penalty avoided:
fewer pipeline flushes
better retirement continuityRoofline Analysis — The Most Important Chart in Performance Engineering
If there is one chart every CPU performance engineer should know, it is the roofline model from Williams, Waterman, and Patterson (2009). It compresses a huge amount of machine behavior into two axes: arithmetic intensity on the X-axis and attainable performance on the Y-axis. Both are usually plotted on log scales because the meaningful range spans orders of magnitude.
Arithmetic intensity means FLOPs per byte of DRAM traffic. If a kernel does almost no arithmetic for each byte fetched, it lives on the left side of the chart and is limited by the slanted memory-bandwidth roof. If it does a great deal of arithmetic per byte, it moves rightward until it collides with the horizontal peak-compute roof.
The intersection of those two ceilings is the ridge point. Left of the ridge, more ALUs do not help because the cores are starved for data. Right of the ridge, more memory bandwidth does not help because the machine is already compute-saturated. For an AVX2+FMA desktop-class part, a ridge point around 10–17 FLOPs/byte is a reasonable intuition. The measured float AI here is 0.031. That is not near the ridge. It is nowhere close.
| Metric | Value | Interpretation |
|---|---|---|
| Float arithmetic intensity | 0.031 FLOPs/byte | Deeply memory-bound float path. |
| Integer arithmetic intensity | 0.895 FLOPs/byte | Quantized integer work has much better byte efficiency. |
| Mixed arithmetic intensity | 0.927 FLOPs/byte | Still far left of a typical AVX2 ridge point. |
| Total GFLOP/s | 0.807 | Tiny fraction of compute peak because bandwidth dominates. |
| Total GINTOP/s | 23.04 | Quantized integer throughput is doing the practical work. |
| DRAM bandwidth | 32.6 GB/s | The active ceiling for decode. |
| L1 / L2 / L3 bandwidth | 3886 / 1768 / 730 GB/s | On-die bandwidth is huge relative to DRAM, but decode does not stay resident there. |
| SP / DP FMA peak | 559.3 / 253.3 GFLOP/s | Horizontal compute roofs. |
| Vectorized loop share | 76.8% | Most runtime is already vectorized. |
| Detected loops / threads / ISA | 5 / 12 / AVX2, AVX | This host is using AVX2 rather than AVX-512. |
The live data tells the whole story in one sentence: float AI is 0.031 FLOPs/byte with DRAM bandwidth around 32.6 GB/s. That is a deeply memory-bound decode path. Quantized integer arithmetic helps by moving the workload toward 0.895–0.927 FLOPs/byte, but even that is still far to the left of where compute peak would become the limiter.
This is also why decode and prefill behave differently. Decode is GEMV-like: one token, one pass, each weight row touched once. Prefill is GEMM-like: multiple prompt tokens reuse the same weight rows, so arithmetic intensity rises and compute ceilings start to matter.
X-axis:
Arithmetic Intensity = FLOPs / bytes from DRAM
Y-axis:
Attainable performance = GFLOP/s
Ceilings:
Memory roof = BW * AI
Compute roof = Peak GFLOP/s
Attainable = min(BW * AI, Peak GFLOP/s)Decode (batch = 1):
GEMV
each weight row used once
AI stays low
usually memory-bound
Prefill (many prompt tokens):
GEMM
weight rows reused across tokens
AI rises
can become compute-boundFP32 decode upper-bound intuition:
2 FLOPs per weight / 4 bytes per weight = 0.5 FLOPs/byte
Q8_0 decode upper-bound intuition:
2 FLOPs per weight / 1 byte per weight ≈ 2.0 FLOPs/byte
Measured float AI in this run:
0.031 FLOPs/byte
Why measured < theoretical:
extra metadata loads
activations and outputs
cache line waste
thread synchronization
imperfect reuseadvisor_summary.json payload{
"summary_metrics": {
"float_arithmetic_intensity": 0.031,
"int_arithmetic_intensity": 0.895,
"mixed_arithmetic_intensity": 0.927,
"total_gflops": 0.807,
"total_gintops": 23.04,
"dram_bw_gb_s": 32.6,
"l1_bw_gb_s": 3886.0,
"l2_bw_gb_s": 1768.0,
"l3_bw_gb_s": 730.0,
"sp_fma_peak_gflops": 559.3,
"dp_fma_peak_gflops": 253.3,
"vectorized_loops_count": 5,
"cpu_threads": 12,
"isa_used": "AVX2, AVX"
}
}function buildRooflineSvg(p) {
const ai = (p && Number.isFinite(Number(p.ai))) ? Number(p.ai) : null;
const gflops = (p && Number.isFinite(Number(p.gflops))) ? Number(p.gflops) : null;
const dramBw = (p && Number.isFinite(Number(p.dramBw))) ? Number(p.dramBw) : 29.0;
const peakFp32 = (p && Number.isFinite(Number(p.peakFp32))) ? Number(p.peakFp32)
: (p && Number.isFinite(Number(p.peakSp))) ? Number(p.peakSp) : 576.0;
const peakFp64 = (p && Number.isFinite(Number(p.peakFp64))) ? Number(p.peakFp64)
: (p && Number.isFinite(Number(p.peakDp))) ? Number(p.peakDp) : 288.0;
const kernels = (p && Array.isArray(p.kernels)) ? p.kernels : [];
const fp32RidgeAI = peakFp32 / dramBw;
const W = 680, H = 390;
const PL = 68, PR = 648, PT = 32, PB = 310;
const xLogMin = Math.log10(0.05), xLogMax = Math.log10(200);
const yLogMin = Math.log10(0.08), yLogMax = Math.log10(2000);const drawRoof = (bw, color, dash, label, tooltipDesc, labelDx = 5, labelDy = -5) => {
const ridgeAI = peakFp32 / bw;
const xStart = Math.pow(10, xLogMin);
const yStart = bw * xStart;
const x1 = toX(xStart), y1 = toY(Math.min(yStart, peakFp32));
const x2 = toX(Math.min(ridgeAI, Math.pow(10, xLogMax)));
const y2 = toY(Math.min(bw * Math.min(ridgeAI, Math.pow(10, xLogMax)), peakFp32));
if (x2 > PL) {
s += `<line x1="${Math.max(x1,PL).toFixed(1)}" y1="${Math.min(y1,PB).toFixed(1)}" x2="${Math.min(x2,PR).toFixed(1)}" y2="${Math.max(y2,PT).toFixed(1)}" stroke="${color}" stroke-width="1.8" stroke-dasharray="${dash}" opacity="0.85"/>`;
const mx = (Math.max(x1,PL)+Math.min(x2,PR))/2;
const my = (Math.min(y1,PB)+Math.max(y2,PT))/2;
s += `<text x="${(mx+labelDx).toFixed(1)}" y="${(my+labelDy).toFixed(1)}" fill="${color}" font-size="8.5">${label}</text>`;
}
};
drawRoof(dramBw, '#f39c12', 'none', `DRAM ${dramBw.toFixed(0)} GB/s`, 'DRAM roof');
drawRoof(200, '#3498db', '6,3', 'L2 ~200 GB/s', 'L2 roof');
drawRoof(800, '#9b59b6', '3,4', 'L1 ~800 GB/s', 'L1 roof');const yFp32 = toY(peakFp32);
const xFp32Ridge = toX(fp32RidgeAI);
if (yFp32 >= PT && yFp32 <= PB) {
s += `<line x1="${Math.min(xFp32Ridge,PR).toFixed(1)}" y1="${yFp32.toFixed(1)}" x2="${PR}" y2="${yFp32.toFixed(1)}" stroke="#47b475" stroke-width="1.8" opacity="0.95"/>`;
s += `<text x="${PR-4}" y="${(yFp32-5).toFixed(1)}" text-anchor="end" fill="#47b475" font-size="8">FP32·FMA ${peakFp32.toFixed(0)} GF/s</text>`;
}
const dramRidgeAI = fp32RidgeAI;
const xRidge = toX(dramRidgeAI), yRidge = toY(peakFp32);
if (xRidge >= PL && xRidge <= PR) {
s += `<line x1="${xRidge.toFixed(1)}" y1="${yRidge.toFixed(1)}" x2="${xRidge.toFixed(1)}" y2="${PB}" stroke="#f39c12" stroke-width="0.8" stroke-dasharray="2,4" opacity="0.5"/>`;
s += `<circle cx="${xRidge.toFixed(1)}" cy="${yRidge.toFixed(1)}" r="4" fill="none" stroke="#f39c12" stroke-width="1.5"/>`;
}kernels.forEach((k, i) => {
if (!Number.isFinite(k.ai) || !Number.isFinite(k.gflops) || k.ai <= 0 || k.gflops <= 0) return;
const kx = toX(k.ai), ky = toY(k.gflops);
const labelOffset = 10 + ((i % 2) * 8);
s += `<circle cx="${kx.toFixed(1)}" cy="${ky.toFixed(1)}" r="7" fill="${k.color||'#aaa'}" opacity="0.75"/>`;
s += `<text x="${kx.toFixed(1)}" y="${(ky-labelOffset).toFixed(1)}" text-anchor="middle" fill="${k.color||'#aaa'}" font-size="7.4">${String(k.label || '')}</text>`;
});
if (ai !== null && gflops !== null && ai > 0 && gflops > 0) {
const wx = toX(ai), wy = toY(gflops);
s += `<circle cx="${wx.toFixed(1)}" cy="${wy.toFixed(1)}" r="10" fill="rgba(255,255,255,0.07)" stroke="rgba(255,255,255,0.3)"/>`;
s += `<circle cx="${wx.toFixed(1)}" cy="${wy.toFixed(1)}" r="5" fill="#ffffff" stroke="rgba(0,0,0,0.6)" stroke-width="1.5"/>`;
s += `<text x="${(wx+14).toFixed(1)}" y="${(wy-6).toFixed(1)}" fill="#ffffff" font-size="9" font-weight="700">Workload</text>`;
}Given:
peak FP32 = 559.3 GFLOP/s
DRAM BW = 32.6 GB/s
Ridge point:
ridge = peak / bandwidth
ridge = 559.3 / 32.6
ridge ≈ 17.16 FLOPs/byte
Measured float AI:
0.031 FLOPs/byte
Distance from ridge:
0.031 / 17.16 ≈ 0.18%
Intel VTune — Microarchitecture Deep Dive
perf stat tells you the hardware symptoms. VTune tells you which code regions are causing those symptoms and, in many modes, how the machine spent its slots in Top-Down Microarchitecture Analysis Method terms: Retiring, Front-End Bound, Bad Speculation, and Back-End Bound, with the back end splitting further into memory-bound and core-bound behavior.
The hotspots list in this run is brutally concentrated. vec_dot_q8_0_q8_0_avx alone takes 14.73 seconds of CPU time, easily more than three quarters of the total hotspot budget. That immediately answers the optimization question: if you want the whole model faster, this is the one function you attack first.
This is Amdahl’s Law in its most useful operational form. If 75% of runtime sits in one kernel, a 2× improvement there lifts total speed by roughly 1.6×. If 0.3% of runtime sits in flash attention, heroic attention tuning will barely move end-to-end decode. VTune is not just a microscope. It is a priority engine.
| Symbol | Time | Percentage |
|---|---|---|
vec_dot_q8_0_q8_0_avx | 14.73s | >75% of hotspot time |
worker_main | 2.45s | Thread-pool worker envelope |
gemm_nt_q8_0_q8_0_avx2 | 0.64s | Prefill GEMM contribution |
__memset_avx2_unaligned_erms | 0.50s | Memory zeroing overhead |
ck_threadpool_dispatch | 0.22s | Dispatch overhead |
gemv_q8_0_q8_0_parallel_simd | 0.22s | Parallel GEMV envelope |
attention_flash_decode | 0.02s | <0.2%, already very fast |
quantize_row_q8_0 | 0.02s | Activation quantization is minor |
vtune_artifacts_v7.pydef parse_hotspots_csv(path: Path, top_k: int = 25) -> List[Dict[str, object]]:
if not path.exists():
return []
text = path.read_text(errors="ignore")
if not text.strip():
return []
rows: List[Dict[str, object]] = []
def _collect(reader: csv.DictReader) -> None:
for row in reader:
symbol = pick_text(row, ["Function", "Function/Call Stack", "Call Stack", "Source Function", "Module"])
if not symbol:
continue
value = pick_value(row, ["CPU Time", "CPU Time:Self", "CPU Time:Total", "Effective Time", "Elapsed Time"])
percent = pick_value(row, ["CPU Time:Self %", "CPU Time %", "Effective Time %", "Elapsed Time %"])
rows.append({"symbol": symbol, "value": value if value is not None else 0.0, "percent": percent if percent is not None else 0.0})ck_cli_v7.csnprintf(cmd, sizeof(cmd), "vtune -collect hotspots -result-dir '%s' -quiet -- %s", vt_hot, base_train);
rc = run_shell_cmd(cmd);
if (rc != 0) return rc;
snprintf(cmd, sizeof(cmd), "vtune -report hotspots -result-dir '%s' -format text -report-output '%s' >/dev/null 2>&1", vt_hot, vt_hot_txt);
run_shell_cmd(cmd);
snprintf(cmd, sizeof(cmd), "vtune -collect memory-access -result-dir '%s' -quiet -- %s", vt_mem, base_train);
int mem_rc = run_shell_cmd(cmd);vtune_summary.jsonpayload: Dict[str, object] = {
"generated_at": utc_now_iso(),
"analysis": "hotspots",
"result_dir": primary.get("result_dir"),
"report_path": primary.get("report_text"),
"csv_path": primary.get("report_csv"),
"top_hotspots": primary.get("top_hotspots", []),
"hotspots": primary.get("hotspots", []),
"raw_text": primary.get("raw_text", ""),
"analyses": analyses,
"analysis_metrics": {
str(entry.get("name") or f"analysis_{i}"): entry.get("summary_metrics", {})
for i, entry in enumerate(analyses)
},
"artifacts": artifacts,
}If vec_dot_q8_0_q8_0_avx is 75% of total time:
Speedup_total = 1 / ((1 - p) + p / s)
With p = 0.75 and s = 2.0:
Speedup_total = 1 / (0.25 + 0.75 / 2)
Speedup_total = 1 / 0.625
Speedup_total = 1.6x
Meaning:
optimize the dominant dot kernel first
Intel Advisor — Roofline with Hardware Measurements
VTune and Advisor overlap, but they are not the same tool. VTune is strongest when you want hotspot ranking and microarchitectural decomposition. Advisor is strongest when you want measured memory traffic and measured arithmetic intensity that place the workload on the roofline with fewer assumptions.
That distinction matters because timing-based arithmetic intensity estimates can be directionally correct but imprecise. Advisor’s roofline collection gives you structured measurements for float AI, integer AI, mixed AI, bandwidth ceilings, vectorization coverage, and loop counts. CKE stores the HTML report, CSV export, and summary JSON so the run can be read later without rerunning the tool.
The vectorization result is especially telling. Advisor sees 5 vectorized loops covering 76.8% of runtime and reports ISA usage as “AVX2, AVX.” So the system is not slow because it forgot to vectorize. It is slow because the vectorized work is waiting on bytes. Advisor moves the conversation from “did we SIMD this?” to “did SIMD actually move us off the memory roof?”
advisor_artifacts_v7.pykey_map = {
"ProgramTime": "elapsed_time_s",
"ElapsedTime": "elapsed_time_s",
"TotalGFLOPS": "total_gflops",
"TotalGFLOPCount": "total_gflop_count",
"TotalGINTOPS": "total_gintops",
"TotalGINTOPCount": "total_gintop_count",
"TotalGMixedOPS": "total_gmixed_ops",
"TotalGMixedOPCount": "total_gmixed_op_count",
"TotalFloatAI": "float_arithmetic_intensity",
"TotalIntAI": "int_arithmetic_intensity",
"TotalMixedAI": "mixed_arithmetic_intensity",
"TotalCPUTime": "total_cpu_time_s",
"TimeInVectorizedLoops": "time_in_vectorized_loops_s",
"TimeInScalarLoops": "time_in_scalar_loops_s",
"TimeOutsideOfAnyLoop": "time_outside_loops_s",
"VectorizedLoopsCount": "vectorized_loops_count",
"CPUThreads": "cpu_threads",
}roof = enrichment.get("advisum_metrics", {}).get("roof_items", [])
for item in roof:
name = item.get("name", "")
bw = item.get("bandwidth_gops")
if not name or bw is None:
continue
if name == "DRAM Bandwidth":
flat["dram_bw_gb_s"] = round(bw, 3)
elif name == "SP Vector FMA Peak":
flat["sp_fma_peak_gflops"] = round(bw, 3)
elif name == "DP Vector FMA Peak":
flat["dp_fma_peak_gflops"] = round(bw, 3)
elif name == "L1 Bandwidth":
flat["l1_bw_gb_s"] = round(bw, 3)
elif name == "L2 Bandwidth":
flat["l2_bw_gb_s"] = round(bw, 3)
elif name == "L3 Bandwidth":
flat["l3_bw_gb_s"] = round(bw, 3)ck_cli_v7.csnprintf(cmd, sizeof(cmd), "advisor --collect=roofline --project-dir '%s' -- %s", adv_dir, base_train);
int collect_rc = run_shell_cmd(cmd);
if (collect_rc == 0) {
snprintf(cmd, sizeof(cmd), "advisor --report=roofline --project-dir '%s' --format=text --report-output '%s' >/dev/null 2>&1", adv_dir, adv_txt);
run_shell_cmd(cmd);
snprintf(cmd, sizeof(cmd), "advisor --report=roofline --project-dir '%s' --format=csv --report-output '%s' >/dev/null 2>&1", adv_dir, adv_csv);
run_shell_cmd(cmd);
snprintf(cmd, sizeof(cmd), "advisor --report=roofline --project-dir '%s' --report-output '%s' >/dev/null 2>&1", adv_dir, adv_html);
run_shell_cmd(cmd);
}{
"analysis": "roofline",
"summary_metrics": {
"float_arithmetic_intensity": 0.031,
"int_arithmetic_intensity": 0.895,
"mixed_arithmetic_intensity": 0.927,
"dram_bw_gb_s": 32.6,
"l1_bw_gb_s": 3886.0,
"l2_bw_gb_s": 1768.0,
"l3_bw_gb_s": 730.0,
"sp_fma_peak_gflops": 559.3,
"dp_fma_peak_gflops": 253.3,
"vectorized_loops_count": 5,
"cpu_threads": 12,
"isa_used": "AVX2, AVX"
},
"html_path": "advisor_roofline.html"
}Flamegraphs — Where Time Actually Goes
Flamegraphs, introduced by Brendan Gregg in 2011, are a visual compression of stack-sampling data. The usual pipeline is simple: perf record captures stacks, perf script expands them, a stack-collapser folds repeated stacks, and a renderer turns the folded counts into an SVG. The width of each box is what matters: wider means more time. Height is call-stack depth. Color is largely decorative.
Flamegraphs and VTune tell the same story from different angles. VTune says the dominant symbol is the quantized dot kernel. The flamegraph says the dominant symbol is gemv_q8_0_q8_0_parallel_simd, and it also shows the call-stack context: that work sits under worker_main and ck_threadpool_dispatch.
That stack context is the big advantage. Hotspot tables answer “what function is expensive?” Flamegraphs add “who called it, and through what runtime envelope?” In this run, flamegraphs reinforce the same constraint chain as VTune: GEMV dominates, thread-pool orchestration is visible, flash attention is tiny.
perf record --all-user -F 999 --call-graph dwarf -o "$fg_perf_data" -- ./build/ck-cli-v7 "$runtime_dir/libmodel.so" "$runtime_dir/weights.bump" --prompt "$fg_prompt" --max-tokens "$fg_max_tokens" --timing --quiet-output
perf script -i "$fg_perf_data" | ./FlameGraph/stackcollapse-perf.pl | tee "$fg_perf_folded" | ./FlameGraph/flamegraph.pl --title="$fg_title" > "$fg_svg"perf_artifacts_v7.pydef parse_folded_top_symbols(folded_path: Path, top_k: int = 25) -> List[Dict[str, object]]:
samples: Dict[str, int] = {}
for line in folded_path.read_text(errors="ignore").splitlines():
line = line.strip()
if not line:
continue
stack, count_s = line.rsplit(" ", 1)
count = int(count_s)
if ";" in stack:
symbol = stack.split(";")[-1]
else:
symbol = stack
samples[symbol] = samples.get(symbol, 0) + count
ranked = sorted(samples.items(), key=lambda kv: kv[1], reverse=True)[:top_k]
return [{"symbol": sym, "samples": cnt} for sym, cnt in ranked]flamegraph_manifest.json top symbols{
"mode": "decode",
"top_symbols": [
{"symbol": "gemv_q8_0_q8_0_parallel_simd", "samples": 4530000000},
{"symbol": "worker_main", "samples": 1450000000},
{"symbol": "gemm_nt_q8_0_q8_0", "samples": 828000000},
{"symbol": "ck_threadpool_dispatch", "samples": 249000000},
{"symbol": "__memcpy_avx_unaligned_erms", "samples": 156000000},
{"symbol": "attention_flash_decode", "samples": 28000000},
{"symbol": "quantize_row_q8_0", "samples": 7600000},
{"symbol": "swiglu_forward", "samples": 5700000}
]
}worker_main;ck_threadpool_dispatch;gemv_q8_0_q8_0_parallel_simd 4530000000
worker_main;ck_threadpool_dispatch;gemm_nt_q8_0_q8_0 828000000
worker_main;ck_threadpool_dispatch;attention_flash_decode 28000000
Interpretation:
first stack is massively wider than the others
the leaf box owns most samples
its parents also appear wide because they sit beneath it
Per-Operation Profiling — The Op-Level Breakdown
The external profilers tell you about hardware and symbols. CKE’s internal CK_PROFILE instrumentation tells you about model semantics. Every operator call can emit a CSV row with mode, token id, layer, op name, and time in microseconds.
That is how the report can say something stronger than “the matmul is hot.” It can say mlp_gate_up consumed 104.3 ms, logits consumed 79.1 ms, and attn consumed only 9.8 ms in the token-0 summary view. The surprise for many readers is that attention is only 2.6% of decode time here. The dominant cost is the MLP and final vocabulary projection.
That is exactly what the matrix shapes predict. The gate-up projection is the widest MLP expansion, and the logits projection hits a 151,936-token vocabulary. Those are giant GEMV-style reads, so they inherit the same bandwidth wall the roofline chart already warned about. 476 measurements That is 28 layers × 17 ops, enough resolution to see exactly which operator family is burning decode time.
| Operation | Time µs | Percentage | Category |
|---|---|---|---|
mlp_gate_up | 104300 | 27.4% | MLP projection |
logits | 79100 | 20.8% | Final vocab projection |
mlp_down | 53200 | 14.0% | MLP projection |
out_proj | 38500 | 10.1% | Attention output projection |
q_proj | 35200 | 9.3% | Attention projection |
v_proj | 21700 | 5.7% | Attention projection |
k_proj | 20100 | 5.3% | Attention projection |
attn | 9800 | 2.6% | Flash attention core |
residual_add | 6700 | 1.8% | Elementwise |
build_summary() from generate_profile_summary_v7.pydef build_summary(entries: List[Dict[str, str]]) -> Dict[str, object]:
by_op: Dict[str, float] = {}
by_layer: Dict[int, Dict[str, float]] = {}
total_us = 0.0
for e in entries:
if e.get("token_id", "0") != "0":
continue
op = e.get("op", "unknown")
layer = int(e.get("layer", -1))
us = float(e.get("time_us", 0))
total_us += us
by_op[op] = by_op.get(op, 0.0) + us
if layer >= 0:
if layer not in by_layer:
by_layer[layer] = {}
by_layer[layer][op] = by_layer[layer].get(op, 0.0) + usprofile_decode.csvmode,token_id,layer,op,time_us
prefill,0,-1,embed,4200
prefill,0,0,q_proj,1180
prefill,0,0,k_proj,710
prefill,0,0,v_proj,760
decode,0,0,q_proj,1330
decode,0,0,k_proj,740
decode,0,0,v_proj,810
decode,0,0,attn,360
decode,0,0,mlp_gate_up,3950
decode,0,0,mlp_down,1880
decode,0,-1,logits,79100profile_summary.json by-op payload{
"total_us": 380890.0,
"total_ms": 380.89,
"by_op": {
"mlp_gate_up": 104300.0,
"logits": 79100.0,
"mlp_down": 53200.0,
"out_proj": 38500.0,
"q_proj": 35200.0,
"v_proj": 21700.0,
"k_proj": 20100.0,
"attn": 9800.0,
"residual_add": 6700.0
}
}by_mode: Dict[str, Dict[str, object]] = {}
for e in entries:
mode = e.get("mode", "unknown")
op = e.get("op", "unknown")
us = float(e.get("time_us", 0))
bucket = by_mode.setdefault(mode, {"total_us": 0.0, "by_op": {}})
bucket["total_us"] = float(bucket["total_us"]) + us
op_map = bucket["by_op"]
op_map[op] = op_map.get(op, 0.0) + usThe Heatmap — Visualizing Time Across Layers
Once you have hundreds of op measurements, tables stop being intuitive. That is why the IR Visualizer renders layer-by-op heatmaps. The layout is natural for a transformer: header work at the top, 28 body layers in the middle, and footer work like final norm and logits at the end.
Color intensity is just time concentration. Dark cells are cheap. Bright cells are expensive. Layer 0 is often a bit hotter because the cache is cold, and the footer often dominates because the vocabulary projection is the single largest matrix-vector multiply in the graph.
The heatmap is not merely decorative. It quickly answers questions like “is one layer pathological?”, “is the footer dominating?”, and “does the cost pattern change between prefill and decode?” Section × op mode is especially useful because it aggregates across layers and shows which operation family dominates each region of the network.
renderProfileHeatmap()function renderProfileHeatmap(data) {
const entries = Array.isArray(data?.entries) ? data.entries : [];
const heatMode = document.getElementById('profileHeatmapMode')?.value || 'layer_op';
const rowScope = document.getElementById('profileHeatmapRows')?.value || 'layers';
const cellValues = new Map();
const rowTotals = new Map();
const colTotals = new Map();
const addCell = (rowKey, colKey, us) => {
const key = `${rowKey}||${colKey}`;
cellValues.set(key, (cellValues.get(key) || 0) + us);
rowTotals.set(rowKey, (rowTotals.get(rowKey) || 0) + us);
colTotals.set(colKey, (colTotals.get(colKey) || 0) + us);
};entries.forEach((entry) => {
const us = safeNum(entry?.time_us, 0);
const op = String(entry?.op || 'unknown');
const layer = parseInt(entry?.layer ?? -1, 10);
const meta = getProfileEntryMeta(entry, metaIndex);
const section = normalizeSectionName(meta?.section, op, layer);
if (heatMode === 'section_op') {
addCell(sectionLabel[section] || section, op, us);
} else if (heatMode === 'layer_section') {
let layerLabel = `L${layer}`;
if (layer < 0) {
if (section === 'header') layerLabel = 'Header';
else if (section === 'footer') layerLabel = 'Footer';
else layerLabel = 'Global';
}
addCell(layerLabel, sectionLabel[section] || section, us);
}
});let maxUs = 0;
sortedRows.forEach((rowKey) => {
sortedCols.forEach((colKey) => {
const v = cellValues.get(`${rowKey}||${colKey}`) || 0;
if (v > maxUs) maxUs = v;
});
});
if (maxUs <= 0) maxUs = 1;
html += '<div style="overflow-x:auto;"><table style="font-size:0.75rem;border-collapse:collapse;">';
html += '<thead><tr><th style="padding:4px 8px;">Row</th>';
sortedRows.forEach((rowKey) => {
sortedCols.forEach((colKey) => {
const us = cellValues.get(`${rowKey}||${colKey}`) || 0;
const intensity = us / maxUs;
const alpha = us <= 0 ? 0 : Math.min(1, 0.18 + 0.82 * Math.pow(intensity, 0.45));
const bg = `rgba(255,180,0,${alpha.toFixed(3)})`;
html += `<td style="background:${bg};text-align:center;">${us > 0 ? (us / 1000).toFixed(1) : ''}</td>`;
});
});
Theory of Constraints — The Bottleneck X-Ray
Eliyahu Goldratt’s Theory of Constraints says a system is only as fast as its tightest constraint. That sounds abstract until you apply it to CPU inference. Then it becomes brutally practical: if arithmetic intensity is far below the ridge point, the real bottleneck is memory bandwidth, not imagination, not hype, and not theoretical peak FLOPs.
CKE encodes that logic directly into the Profile tab. When Advisor data exists, it classifies the run as MEMORY-BOUND, BALANCED, or COMPUTE-BOUND using arithmetic intensity relative to the ridge point. When Advisor is missing, it falls back to IPC as a rough estimate.
For this Qwen3 decode run the conclusion is unambiguous. The constraint chain starts with DRAM bandwidth utilization, then compute throughput, then cache efficiency, then GEMM/GEMV time share, then branch prediction. Branch prediction ends up green. DRAM and cache behavior do not. This is why the IR report feels like an X-ray. It is not just plotting counters; it is ranking the weak links from weakest to strongest.
renderProfileTOC()const ai = physics.computeIntensity;
const ridgePoint = 10.0;
if (ai !== null && ai < ridgePoint * 0.6) {
constraint = 'memory-bound';
constraintTitle = 'MEMORY-BOUND';
constraintExplain = `Arithmetic Intensity = ${ai.toFixed(3)} FLOPs/byte — well below the ridge point (~${ridgePoint} FLOPs/byte). The CPU's compute units are starved waiting for data.`;
constraintAction = 'Reduce memory traffic: better quantization, tighter cache blocking, prefetch hints, weight layout optimization.';
} else if (ai !== null && ai >= ridgePoint * 0.6 && ai < ridgePoint * 1.4) {
constraint = 'balanced';
} else if (ai !== null && ai >= ridgePoint * 1.4) {
constraint = 'compute-bound';
} else if (ipcVal !== null) {
if (ipcVal < 1.0) constraint = 'memory-bound';
}if (physics.memoryBwGBs !== null) {
const dramPeak = 29.0;
const utilPct = (physics.memoryBwGBs / dramPeak) * 100;
chain.push({
label: 'DRAM Bandwidth',
value: `${physics.memoryBwGBs.toFixed(1)} GB/s`,
capacity: `${dramPeak.toFixed(0)} GB/s peak`,
utilization: utilPct,
insight: utilPct > 70
? 'Near DRAM saturation — this is the wall.'
: 'Moderate DRAM pressure.'
});
}
if (physics.gflops !== null) {
const estPeak = 288.1;
const utilPct = (physics.gflops / estPeak) * 100;
chain.push({
label: 'Compute Throughput',
value: `${physics.gflops.toFixed(2)} GFLOP/s`,
capacity: `~${estPeak.toFixed(0)} GFLOP/s peak (DP FMA)`,
utilization: utilPct,
insight: utilPct > 60 ? 'Good compute utilization' : 'Low compute utilization — data starvation or scalar code is wasting SIMD width.'
});
}if (cmrVal !== null) {
const missRate = cmrVal * 100;
chain.push({
label: 'Cache Efficiency',
value: `${missRate.toFixed(2)}% miss rate`,
capacity: '<1% ideal for streaming workloads',
insight: missRate < 2 ? 'Excellent cache hit rate' : 'High cache miss rate — this is a significant bottleneck.'
});
}
if (gemmPct > 0) {
chain.push({
label: 'GEMM/GEMV Time Share',
value: `${gemmPct.toFixed(1)}%`,
capacity: '>85% ideal (matmul should dominate)',
insight: gemmPct > 80 ? 'Matmul dominates — overhead is minimal.' : 'Low GEMM share — non-matmul overhead is the constraint.'
});
}
if (bmr !== undefined) {
const missRate = Number(bmr) * 100;
chain.push({
label: 'Branch Prediction',
value: `${missRate.toFixed(2)}% miss rate`,
capacity: '<0.5% ideal',
insight: missRate < 1 ? 'Excellent branch behavior.' : 'Too many mispredicts.'
});
}Classification:
MEMORY-BOUND
Primary chain, weakest first:
1. DRAM Bandwidth utilization
2. Compute Throughput utilization
3. Cache Efficiency
4. GEMM/GEMV Time Share
5. Branch Prediction efficiency
Narrative:
AI is far below ridge point
cache miss rate is 63.9%
branch behavior is excellent
the wall is DRAM bandwidthThe Perf Gate — Automated Budget Enforcement
Performance engineering is not finished when you understand a bottleneck once. It is finished when regressions are stopped automatically. That is the role of the v7 perf gate: turn expected decode throughput and hardware-health floors into merge-blocking budgets.
For the model families discussed here, the defaults are straightforward. min_decode_tok_s is 8.0, min_ipc is 0.6, max_cache_miss_rate is 0.25, and max_branch_miss_rate is 0.08. If a change drops decode throughput to 6 tok/s or pushes cache misses well past the expected band, the CI gate fails.
This is what makes performance a contract instead of a dashboard screenshot. Budgets can also be overridden per family with environment variables such as CK_V7_PERF_QWEN3_MIN_DECODE_TOK_S. That lets teams tighten or relax expectations without rewriting the evaluator. CI is where performance intent becomes institutional memory.
| Metric | Budget | Actual | Status |
|---|---|---|---|
| Decode throughput | ≥ 8.0 tok/s | 15.7 tok/s | PASS |
| IPC | ≥ 0.6 | 1.42 | PASS |
| Cache miss rate | ≤ 25% | 63.9% | FAIL if strict budget is applied |
| Branch miss rate | ≤ 8% | 0.52% | PASS |
| Family | qwen3 | qwen3 | Matched |
resolve_budgets() from perf_gate_v7.pydef resolve_budgets(family: str) -> Dict[str, float]:
base = {
"min_decode_tok_s": 8.0,
"min_ipc": 0.6,
"max_cache_miss_rate": 0.25,
"max_branch_miss_rate": 0.08,
}
family_defaults = {
"qwen2": {"min_decode_tok_s": 8.0},
"qwen3": {"min_decode_tok_s": 8.0},
"gemma": {"min_decode_tok_s": 8.0},
}
if family in family_defaults:
base.update(family_defaults[family])resolve_budgets()overrides = {
"min_decode_tok_s": parse_env_float(
f"CK_V7_PERF_{env_family}_MIN_DECODE_TOK_S",
"CK_V7_PERF_MIN_DECODE_TOK_S",
),
"min_ipc": parse_env_float(
f"CK_V7_PERF_{env_family}_MIN_IPC",
"CK_V7_PERF_MIN_IPC",
),
"max_cache_miss_rate": parse_env_float(
f"CK_V7_PERF_{env_family}_MAX_CACHE_MISS_RATE",
"CK_V7_PERF_MAX_CACHE_MISS_RATE",
),
"max_branch_miss_rate": parse_env_float(
f"CK_V7_PERF_{env_family}_MAX_BRANCH_MISS_RATE",
"CK_V7_PERF_MAX_BRANCH_MISS_RATE",
),
}def compare_ge(value: Optional[float], threshold: float) -> Tuple[bool, str]:
if value is None:
return False, "missing"
return value >= threshold, "ok" if value >= threshold else "below_threshold"
def compare_le(value: Optional[float], threshold: float) -> Tuple[bool, str]:
if value is None:
return False, "missing"
return value <= threshold, "ok" if value <= threshold else "above_threshold"def compute_decode_tok_s(profile: Dict) -> Tuple[Optional[float], Dict[str, float]]:
entries = profile.get("entries")
if isinstance(entries, list) and entries:
decode_entries = [e for e in entries if str(e.get("mode", "")) == "decode"]
if decode_entries:
total_decode_us = sum(_safe_float(e.get("time_us")) for e in decode_entries)
token_ids = set()
for e in decode_entries:
token_ids.add(int(e.get("token_id", 0)))
decode_tokens = len(token_ids)
if total_decode_us > 0 and decode_tokens > 0:
tok_s = decode_tokens * 1_000_000.0 / total_decode_us
return tok_s, {"decode_total_us": total_decode_us, "decode_tokens": float(decode_tokens)}
return None, {}export CK_V7_PERF_QWEN3_MIN_DECODE_TOK_S=12.0
export CK_V7_PERF_QWEN3_MIN_IPC=0.8
export CK_V7_PERF_QWEN3_MAX_CACHE_MISS_RATE=0.70
export CK_V7_PERF_QWEN3_MAX_BRANCH_MISS_RATE=0.02
python3 version/v7/scripts/perf_gate_v7.py --model-dir <model-dir>perf_gate_report.json interpretation{
"family": "qwen3",
"checks": {
"decode_tok_s": {"budget": 8.0, "actual": 15.7, "status": "ok"},
"ipc": {"budget": 0.6, "actual": 1.42, "status": "ok"},
"cache_miss_rate": {"budget": 0.25, "actual": 0.639, "status": "above_threshold"},
"branch_miss_rate": {"budget": 0.08, "actual": 0.0052, "status": "ok"}
},
"overall_ok": false
}The IR Visualizer — How CKE Puts It All Together
The IR Visualizer is the glue that turns a folder full of JSON and SVG files into a coherent observability surface. In v7 it is a single offline HTML file with embedded JavaScript, totaling about 24,950 lines. Its 11 tabs span Memory, Kernel Flow, Interpretability, Weight Dtype Audit, Parity Debug, Dataflow Graph, Tests, Statistics, Backprop IR, Data & Tokenizer, and Profile.
The Profile tab is where the performance story converges. It bootstraps embedded JSON blobs, detects which tools were available on the original host, renders guidance when some artifacts are missing, and then merges the six profiling sources into one dashboard. That is why a silicon engineer can open one HTML file and immediately inspect counters, hotspots, rooflines, flamegraphs, cache stats, and op heatmaps.
The offline nature matters. Reports can be archived, emailed, attached to bug reports, or shared with a vendor without requiring a live Python service or a profiler installation on the receiving machine. One file, one run, one observability bundle.
bootstrapFromEmbeddedData() wiring in the visualizerfunction bootstrapFromEmbeddedData() {
if (embeddedBootstrapped) return true;
const embedded = window.EMBEDDED_IR_DATA;
if (!embedded || !embedded.files) return false;
embeddedMeta = embedded.meta || null;
const files = embedded.files;
setModeData('decode', 'ir1', files.ir1_decode || null);
setModeData('decode', 'layout', files.layout_decode || null);
setModeData('decode', 'call', files.lowered_decode_call || null);
setModeData('decode', 'lowered', files.lowered_decode || files.lowered_decode_call || null);
perfStatData = files.perf_stat_summary || perfStatData;
flamegraphManifestData = files.flamegraph_manifest || flamegraphManifestData;
cachegrindSummaryData = files.cachegrind_summary || cachegrindSummaryData;
vtuneSummaryData = files.vtune_summary || vtuneSummaryData;
advisorSummaryData = files.advisor_summary || advisorSummaryData;
perfGateData = files.perf_gate_report || perfGateData;
}renderProfileHowTo() detects host-tool availabilityfunction renderProfileHowTo() {
const profileToolStatus = (embeddedMeta && typeof embeddedMeta.profile_tool_status === 'object')
? embeddedMeta.profile_tool_status
: {};
const linuxHost = String(profileToolStatus.host_platform || '').toLowerCase().startsWith('linux');
const perfInstalled = Boolean(profileToolStatus.perf);
const flamegraphInstalled = Boolean(profileToolStatus.flamegraph);
const cachegrindInstalled = Boolean(profileToolStatus.valgrind) && Boolean(profileToolStatus.cg_annotate);
const vtuneInstalled = Boolean(profileToolStatus.vtune);
const advisorInstalled = Boolean(profileToolStatus.advisor);
const coreHostReady = linuxHost && perfInstalled && flamegraphInstalled && cachegrindInstalled;
}perf_artifacts_v7.py writes normalized JSON outputsif args.perf_stat and args.perf_stat.exists():
perf_summary = parse_perf_stat_text(args.perf_stat.read_text(errors="ignore"))
perf_summary["source"] = str(args.perf_stat)
perf_summary["cpu_topology"] = collect_cpu_topology()
write_json(out_dir / "perf_stat_summary.json", perf_summary)
if args.flamegraph_svg and args.flamegraph_svg.exists():
top_symbols = parse_folded_top_symbols(args.folded) if args.folded else []
manifest = {
"generated_at": utc_now_iso(),
"svg_path": str(args.flamegraph_svg),
"top_symbols": top_symbols,
}
write_json(out_dir / "flamegraph_manifest.json", manifest)
if args.vtune_summary and args.vtune_summary.exists():
vtune_payload = json.loads(args.vtune_summary.read_text())
write_json(out_dir / "vtune_summary.json", vtune_payload)make profile-v7-decode
python3 version/v7/scripts/generate_profile_summary_v7.py --work-dir <model-dir>
make profile-v7-perf-stat
python3 version/v7/scripts/perf_artifacts_v7.py --model-input <model> --perf-stat build/ck_v7_perf_stat.txt
make profile-v7-flamegraph
make profile-v7-vtune
make profile-v7-advisor
make profile-v7-cachegrind
python3 version/v7/tools/open_ir_visualizer.py --generate --run <model-dir> --html-only --strict-run-artifacts
What This Means for AI on CPU
The first conclusion is blunt: single-token decode on CPU is almost always memory-bound. That is not a software embarrassment. It is the physics of batch-1 GEMV. Each weight row is consumed once, so more compute units help less than more effective bytes per second.
That is also why quantization is fundamentally a memory-bandwidth optimization. Q4 moves roughly one quarter the weight bytes of FP32. Q8 moves one quarter the bytes of FP32 as well. The compute instructions change too, but the first-order win is that fewer bytes need to cross the slowest link in the machine.
Prefill changes the math because it increases reuse. Once many prompt tokens are processed together, the same weight rows participate in more arithmetic, arithmetic intensity rises, and compute throughput starts to matter more. That is where AVX-512, AMX, or future SVE2/SME-style matrix capabilities matter most.
This creates a strategic split in optimization policy. Single-user decode wants bandwidth: faster DRAM, more channels, tighter quantization, and better layouts. Multi-user batched serving wants both bandwidth and compute, because the operating point slides rightward on the roofline. 844.8 GB/s That kind of channel-rich memory subsystem directly targets the real bottleneck of batch-1 decode: DRAM throughput.
That is why server-class ARM parts are so interesting for CPU inference. A Neoverse V3-class system with 12 DDR5 channels at 8800 MT/s implies 12 × 8 bytes × 8.8 GT/s = 844.8 GB/s of raw memory bandwidth. For a decode workload trapped behind DRAM, that number is not a detail. It is the system thesis.
Decode:
batch = 1
GEMV-like
weight rows streamed once
memory-bound
optimize bytes moved
Prefill:
batch > 1 / prompt tokens > 1
GEMM-like
weight rows reused
can become compute-bound
optimize vector width and fused compute tooAssume:
12 memory channels
DDR5-8800 MT/s
64-bit channel width = 8 bytes
Bandwidth:
12 * 8 bytes * 8.8e9 transfers/s
= 844.8e9 bytes/s
= 844.8 GB/s raw theoretical bandwidth
For memory-bound decode:
more channels can mean more tokens/s
Conclusion — The Performance Observatory
The deepest point of this post is that CKE does not just run inference. It diagnoses its own performance. That is a different class of software competence.
Six profiling artifacts, eleven visualization tabs, per-op instrumentation, roofline reasoning, hotspot attribution, cache analysis, and automated perf gates together form a real performance observatory. That is exactly what silicon vendors need when they ask not “does it run on our chip?” but “what is the bottleneck on our chip, and what should we optimize next?” The answer here is clear: Qwen3 decode on this AVX2 hybrid CPU is constrained by memory movement, concentrated inside the quantized dot/GEMV path.
That connects cleanly back to the recent CPU-kernel series. The SIMD deep dive explained the SIMD ladder. The ARM NEON post put ARM vector kernels in the same frame. The quantization post showed why byte reduction is the CPU story. The flash attention post showed why attention matters yet is not the main bottleneck in this run. Next up: DeltaNet and hybrid attention architectures, where the observability question becomes even more important because the runtime surface is more heterogeneous.
6 profiling artifacts
x 11 visualizer tabs
x automated perf gates
= one portable performance observatory for CPU inferenceFollow the series
Read the previous CPU-kernel posts: SIMD Deep Dive, ARM NEON and CKE, Quantization Deep Dive, and Flash Attention on CPU.
Follow the implementation in C-Kernel-Engine on GitHub, the CKE documentation hub, and the companion videos on ANTSHiV Robotics YouTube.