This is a research-style report on the SVG training experiments I ran through C Kernel Engine v7. The goal was not to prove that a tiny model can magically become a frontier visual model. The goal was more grounded: use SVG generation as a controlled training problem so I could understand the relationship between data, tokenizer design, model capacity, evaluation gates, curriculum, and runtime artifacts.

From the outside, AI training can feel like mumbo jumbo. You hear words like pretraining, fine-tuning, evals, embeddings, context length, loss curves, and curriculum. Then you run your own model and quickly discover the uncomfortable part: the loss can go down while the model still fails the actual contract you care about.

That is the real lesson from these experiments. Training is not only optimization. Training is a system design problem. The data, representation, tokenizer, architecture, evaluation, and runtime all have to line up.

I should also be clear about the limits of this work. I have no inside view into how frontier AI labs actually do pretraining at scale. I could be doing parts of this in a clumsy or wrong way. This may be mumbo jumbo compared with a real large-scale training pipeline. But that is also the point of the lab: even if the “correct” industrial method is different, building the pipeline myself teaches me what the words mean in practice. It turns tokenizer, curriculum, evals, loss, data mixture, and probes from abstract vocabulary into things I can inspect, break, repair, and reason about.

Thesis

The SVG experiments were not about pretty pictures. They were a controlled laboratory for learning how a small transformer memorizes, generalizes, fails, repairs, and becomes measurable through reports, probes, datasets, and generated runtime artifacts.

Research Snapshot

This report synthesizes the current evidence bundle for the SVG training line: spec02 and spec03 bootstrap runs, the structured scene and infographic runs, the scene-DSL transition, the bundle-routing failures, the spec19 recovery runs, and the later gen1 experiments visible in the local IR hub. I did not find a formal spec20 artifact in the packaged manifest. Instead, after spec19 the naming changed: runs began using names such as gen1_full_scene_dsl_l3_d192_h384_ctx512_r2 and gen1_atomic_scene_dsl_l3_d192_h384_ctx2048_r1.

A quick note on terminology: in this report I use “IR hub” to mean the local experiment dashboard generated by C Kernel Engine for this training line. IR means intermediate representation: the structured representation between raw training text, the model, the compiler, and the final SVG output. The useful hub for this report is the generated ir_hub.html dashboard, not only an older ir_hub.json index file. The HTML hub is broader because it links together runs, reports, probe pages, dataset viewers, embedding views, attention views, and inference cards.

At the time I reviewed the hub, it listed 164 total runs, 151 training runs, 142 runs with reports, 119 runs with probe-report HTML, 94 parity-pass runs, 36 runs with embedding views, 24 dataset viewers, and 19 attention views. That matters because the post-spec19 story is not one clean run that simply got better. It is a branch of training experiments, reporting infrastructure, tokenizer checks, compiler contracts, failed rungs, and recovery attempts. Once I package the artifacts as a zip, the hub is the map that makes the experiment trail readable instead of just being a pile of folders.

Question Observed Answer Evidence
Can a local C runtime train a small transformer on SVG-like structure? Yes, but only after the task is made measurable and the representation becomes less brittle. Spec02/spec03 bootstrap reports, CKE v7 train reports, probe reports.
Is raw SVG the right target? Not as the main long-term target. Raw SVG is useful for smoke tests, but too fragile for curriculum growth. Spec04 renderability/exactness split, later scene-DSL reports.
Did the scene DSL help? Yes. Bounded scene DSL contracts reached strong and sometimes perfect probe scores. Spec10, spec11, spec12, spec14, spec15 reports.
Did widening the task break the model? Yes. Spec17 and spec18 exposed routing, family-balance, and scaffold-retention failures. Spec17 audit, spec18 plan, probe autopsies.
Did targeted curriculum repair help? Yes. Spec19 recovered much of the bundle contract with balanced coverage and instruction variants. Spec19 curriculum blueprint and probe reports.

The short version is simple: the project moved from “can I train something that emits SVG?” toward “can I define a compiler-backed language, train against it, and use probes to decide what to repair next?” That is a very different level of discipline.

How To Read This Report

This post is intentionally long because the important result is not one screenshot or one metric. The important result is the shape of the engineering loop. If I only show the best generated SVG, I hide the process. If I only show the loss curve, I hide the contract failures. If I only show the final spec19 result, I hide the fact that spec17 and spec18 were necessary failures.

So the report should be read in four layers:

  1. The runtime layer: C Kernel Engine v7 can train, report, probe, and package evidence locally.
  2. The representation layer: raw SVG was too brittle, so the task moved toward scene DSLs and bundle DSLs.
  3. The evaluation layer: renderable, exact, materialized exact, SVG exact, and budget truncation measure different things.
  4. The curriculum layer: each spec changed the pressure placed on the model, and the failures told me what to change next.

That is why I keep using words like contract, compiler, probe, and ledger. A training run is only useful if it leaves enough evidence to decide the next move.

Correction: What The Evidence Actually Says

My first narrative was still too clean. It described the arc, but it did not preserve enough of the ugly details. The actual artifacts show a messier and more useful lesson: this was not one training run that gradually got better. It was a sequence of representation changes, tokenizer boundary discoveries, dataset generators, compiler contracts, failed rungs, probe reports, and recovery attempts.

The biggest correction is about the tokenizer. It is too blunt to say “BPE was wrong.” The ascii_bpe tokenizer repeatedly passed byte-perfect encode/decode roundtrip. In spec03, for example, the tokenizer roundtrip passed with exact_match=true, byte_match_rate=1.0, and line_match_rate=1.0. The problem was not reversibility. The problem was that ordinary BPE did not automatically discover the control interface I cared about.

The spec03 tokenizer analysis says the old bad cross-row merge class was fixed, but exact learned pieces for canonical prompt control tags were still 0 / 50. That means the tokenizer could reconstruct the text, but the model still had to learn the control language through fragmented token chains. For a tiny model trained on a formal DSL, that is a real problem. The control tags needed to be protected atoms.

The second correction is about curriculum. It is also too blunt to say “incremental curriculum did not work.” Some targeted repairs helped. Spec19 r2, r3a, and r3d clearly improved over r1. But the broader lesson is that local patching against a brittle scaffold can regress, while a clean structured mixture can recover the task better. Spec19 r3d and the SFT-B instruction variant were strong because the mixture was broad and structured: anchors, routebook prompts, minimal pairs, style/topology bridges, paraphrases, hidden recombination, and balanced family coverage. Broadness alone was not enough. The spec_broader_1 run was broad and still failed badly.

So the corrected lesson is this:

For a small local model, the training problem is not only “more data” or “lower loss.” The training problem is choosing the right symbolic interface, protecting the right atoms, generating the right curriculum mixture, and proving the contract with probes.

The Spec And Rung Ledger

The packaged evidence bundle records nineteen formal specs. Spec02 and spec03 exist as bootstrap and tokenizer/report artifacts, but the packaged manifest starts its formal spec ledger at spec04. I also found spec12 r20, but that is a rung inside spec12, not spec20. After spec19, the experiment naming changed to generation/context/rung labels. That means spec19 is not the end of the work; it is the transition point into the gen1 line.

This table is the compressed ledger. The detailed narrative follows below it.

Stage Representation Runs Best Result Lesson
Spec02 Raw/generated SVG instruction data Bootstrap runs Roundtrip passed; output still brittle Raw SVG can train, but it is a poor control language.
Spec03 Tokenizer prompt atoms and normalized SVG assets v1024 and v2048 bootstrap Byte-perfect tokenizer roundtrip Reversibility was solved; control-token atomicity was not.
Spec04 Structured SVG atoms/scenes 2 formal runs 100% renderable, 0% exact Renderable is not the same as obeying the contract.
Spec05 Structured scenes 1 canonical formal run 80.8% exact, 92.3% renderable Scaling the scene contract gave real held-out behavior.
Spec06 Structured infographics 7 runs 91.7% exact, 100% renderable Infographic structure was learnable with a stronger dataset generator.
Spec07 First scene DSL 2 runs 38.9% exact, 75% renderable The first DSL was not automatically better; grammar design mattered.
Spec08 Rich scene DSL 2 runs 80.6% exact, 91.7% renderable Richer but cleaner DSL recovered much of the lost behavior.
Spec10 Asset scene DSL 5 runs 94.3% exact, 100% renderable Asset-backed compilation made the visual task more deterministic.
Spec11 Keyed scene DSL 3 runs 100% exact, 100% renderable Keyed references separated content from layout and produced the first gold line.
Spec12 Gold scene DSL, ctx768 20 rungs plus foreground variant 100% visible and hidden exact at r17 Compact scene DSL plus better tokenizer surface produced a stable gold contract.
Spec13a Intent prompt bridge 6 runs 59.6% exact, 100% renderable Natural-language intent bridging was harder than DSL reproduction.
Spec13b Decision-tree IR and scene bridge 8 runs 55.6% exact, 100% renderable Planning IR helped structure, but exact routing remained difficult.
Spec14a Comparison-board family DSL 11 runs 97.1% exact, 100% renderable Bounded visual families were easier to solve than universal SVG.
Spec14b Timeline family DSL 3 runs 100% exact, 100% renderable A narrow family with a clear compiler became gold quickly.
Spec15a Memory-map family DSL 11 runs 100% exact, 100% renderable Family-specific contracts can be made extremely reliable.
Spec15b System-diagram family DSL 2 runs 100% exact, 100% renderable The bounded-family strategy generalized across multiple visual forms.
Spec16 Generalized visual bundle DSL 12 runs 83.3% exact, 89.6% renderable Combining families reopened routing and scaffold-retention failures.
Spec17 Bundle routing repair 4 runs 16.7% exact, 61.9% renderable The model could render fragments, but the routing contract was broken.
Spec18 Routing-first decomposition 1 run 4.8% exact, 64.3% renderable Decomposition alone did not fix the collapsed bundle interface.
Spec19 Textbook routing mixture 11 runs 88.6% exact, 95.5% renderable Structured mixture and instruction SFT recovered much of the shared-bundle task.
Broader-1 Broader scene DSL 1 run 2.4% exact, 54.8% renderable Broad data without the right scaffold was not enough.

Artifact-Derived Run Matrix

This table is generated from local cached artifacts, not memory: probe_report.json, tokenizer_manifest.json, dataset_profile.json, and run_ledger.jsonl. It is still not a perfect benchmark table because each spec changed the task. The point is to show the shape of the evidence: exactness, renderability, materialization, tokens processed, and loss minima moved independently. side note Do not compare rows as one benchmark. Spec04 and spec19 are different tasks. The comparison is useful as a training ledger, not as a leaderboard.

RunTokenizer lineCasesExactRenderableMaterializedTrain tokensMin loss
Spec02 raw SVG bootstrapn/an/an/an/an/a27,103,7440.2220
Spec04 structured scenesn/a200.0%100.0%0.0%1,769,4720.0058
Spec05 structured scenes r2n/a2680.8%92.3%80.8%2,621,4400.0139
Spec06 infographics r1n/a2425.0%100.0%29.2%2,621,4400.0066
Spec06 infographics r6n/a3691.7%100.0%91.7%2,621,4400.0067
Spec07 scene DSL r2n/a3638.9%75.0%44.4%1,899,0080.0407
Spec08 rich scene DSL r2n/a3680.6%91.7%80.6%2,334,7200.0136
Spec10 asset scene r4n/a3594.3%100.0%97.1%1,970,1760.0170
Spec11 keyed scene r2n/a35100.0%100.0%100.0%4,506,1120.0090
Spec12 scene DSL r17spec12_scene_dsl36100.0%100.0%100.0%65,5360.0309
Spec16 bundle r11spec16_scene_bundle4881.2%95.8%83.3%1,610,4960.0115
Spec17 bundle r3spec17_scene_bundle4216.7%61.9%16.7%26,8800.5231
Spec18 bundle r1spec18_scene_bundle424.8%64.3%4.8%29,9521.0410
Spec19 r1spec19_scene_bundle424.8%66.7%4.8%32,2560.8443
Spec19 r2spec19_scene_bundle4852.1%81.2%54.2%419,3280.0139
Spec19 r3a delta replayspec19_probe_miss_delta4885.4%93.8%85.4%62,9760.0082
Spec19 r3b coherent replayspec19_coherent_replay_union4481.8%95.5%81.8%278,0160.0105
Spec19 r3d balanced coveragespec19_coherent_replay_union4486.4%95.5%86.4%382,4640.0079
Spec19 SFT-B instructionspec19_coherent_replay_union4488.6%95.5%88.6%393,2160.0182
Spec19 r4 unified curriculumspec19_coherent_replay_union4461.4%86.4%63.6%437,7600.0438
Gen1 scene r0 bootstrapgen1_scene_dsl422.4%54.8%2.4%480,2560.0807
Gen1 full r2gen1_full_scene_dsl42100.0%100.0%100.0%1,510,4000.0844
Gen1 full r3 recombgen1_full_scene_dsl3243.8%100.0%62.5%1,324,0320.0796
Gen1 full r4 editorialgen1_full_scene_dsl3240.6%100.0%62.5%1,627,1360.0603
Gen1 full r5 architecturegen1_full_scene_dsl3243.8%100.0%62.5%2,217,4720.0310
Gen1 atomic ctx2048gen1_full_scene_dsl60.0%0.0%0.0%n/an/a

Spec02 To Gen1 Lab Notes: Learning Ladder, Tokens, DSL, And Compile Evidence

This chapter consolidates the spec-by-spec lab notes and the learning ladder into one sequence. I am writing it this way because the important learning was not the final number. The important learning was what each spec changed, what data was being trained, what the tokenizer was doing, what the DSL looked like, what the probes revealed, whether the output rendered, and why the next spec existed.

I pulled this table from the cached run artifacts, not from memory: probe reports, tokenizer roundtrip files, compiler-smoke reports, and the generated IR hub. The table is representative rather than exhaustive. Some specs had many rungs, so I list the run that best captures the lesson or the failure frontier.

Spec / Run Probe Cases Exact Renderable Tokenizer Signal Compiler / DSL Evidence Representative Target
Spec02 n/a n/a n/a ascii_bpe roundtrip passed; 60,000 lines; about 2.97M tokens. Raw SVG bootstrap, not yet a clean DSL contract. Instruction-to-SVG rows.
Spec03 n/a n/a n/a Roundtrip passed, but tag-seed audit showed important control atoms were not stable learned pieces. Tokenizer/reporting infrastructure became the main artifact. Reserved structural tags became necessary.
Spec04 20 0.0% 100.0% Roundtrip passed; 4,468 lines; about 559K tokens. Structured SVG atom target rendered, but contract exactness was zero. [svg] [w:128] [h:128] [circle] ... [/svg]
Spec05 26 80.8% 92.3% Roundtrip passed; 37,812 lines; about 3.43M tokens. Bounded structured scenes became learnable. [svg] [layout:single] [frame:card] [circle] ... [/svg]
Spec06 r6 36 91.7% 100.0% Roundtrip passed; 23,867 lines; about 6.88M tokens. Structured infographic target worked when the contract was narrow. [svg] [layout:bullet-panel] [topic:governance_path] ... [/svg]
Spec07 r2 36 38.9% 75.0% Roundtrip passed; 23,935 lines; about 1.54M tokens. First scene DSL was directionally right but still brittle. [scene] [canvas:wide] [layout:bullet-panel] ... [/scene]
Spec08 r2 36 80.6% 91.7% Roundtrip passed; 23,954 lines; about 1.85M tokens. Richer scene DSL improved the model-facing contract. [scene] [layout:bullet-panel] [theme:signal_glow] ... [/scene]
Spec10 r4 35 94.3% 100.0% Roundtrip passed; 5,115 lines; about 409K tokens. Asset scene DSL started to separate structure from reusable visual assets. [scene] [layout:poster_stack] [topic:memory_reality] ... [/scene]
Spec11 r2 35 100.0% 100.0% Roundtrip passed; 5,295 lines; about 1.12M tokens. Keyed scene DSL proved that references beat repeated literal payloads. [text_block:title] [ref:headline] [/text_block]-style structure.
Spec12 r17 36 100.0% 100.0% Roundtrip passed; 3,657 lines; about 527K tokens. Gold scene DSL reached a bounded exact/renderable contract. [scene] [layout:table_matrix] [ref:groups] ... [/scene]
Spec13a / 13b 52 / 18 59.6% / 55.6% 100.0% / 100.0% Roundtrip passed in both representative runs. Intent and planning prompts were harder than reproducing known scene contracts. [layout:decision_tree] with node and edge references.
Spec14 / 15 34-48 Strong but family-dependent Mostly high Roundtrip passed across representative runs. Bounded families like comparison boards, timelines, memory maps, and system diagrams worked better than broad routing. Family-specific scene DSLs.
Spec16 r11 48 81.2% 95.8% Roundtrip passed; 1,235 lines; about 225K tokens. Shared bundle contract worked partly, but routing pressure was now real. [bundle] [family:memory_map] [form:arena_sections] ... [/bundle]
Spec17 r3 42 16.7% 61.9% Roundtrip passed; 117 lines; about 13K tokens. Routing repair failed; renderable fragments hid contract collapse. Bundle DSL with family/form/style routing.
Spec18 r1 42 4.8% 64.3% Roundtrip passed; 90 lines; about 13K tokens. Routing-first decomposition alone did not fix the shared interface. Same bundle DSL, worse exactness.
Spec19 r3d 44 86.4% 95.5% Roundtrip passed; 513 lines; about 85K tokens. Compiler smoke: 9/9 compiled, 9/9 SVG exact, pass=true. [bundle] [family:memory_map] [form:typed_regions] ... [/bundle]
Gen1 full r2 42 100.0% 100.0% Roundtrip passed; 15,540 lines; about 1.14M tokens. Full scene DSL solved the visible probe before recombination pressure was added. [scene] [layout:poster_stack] [topic:c_kernel_engine_overview] ... [/scene]
Gen1 atomic ctx2048 n/a n/a n/a Roundtrip failed; 56,268 lines; about 54.9M tokens; byte match about 1.9%. The tokenizer/materialization boundary broke at the larger atomic-binding surface. [scene] [canvas] was not enough; the interface itself needed repair.

A concrete spec19 compiler-smoke case shows the intended lowering path. The model-facing bundle is short:

Spec19 bundle DSL text
[bundle] [family:memory_map] [form:typed_regions] [theme:infra_dark] [tone:blue] [density:balanced] [background:none] [segments:5] [brackets:0] [cards:3] [/bundle]

The compiler lowers that bundle into a more explicit scene DSL:

Lowered scene DSL text
[scene] [layout:memory_map] [theme:infra_dark] [tone:blue] [density:balanced] [topic:memory_map_generic] [header_band:header] [address_strip:offsets] [memory_segment:headers|segments.headers] [memory_segment:q4k_weights|segments.q4k_weights] [memory_segment:bf16_cache|segments.bf16_cache] [memory_segment:activations|segments.activations] [memory_segment:scratch|segments.scratch] [info_card:tensor_structure|cards.tensor_structure] [info_card:key_principles|cards.key_principles] [info_card:memory_savings|cards.memory_savings] [/scene]

The compiler-smoke artifact then emitted an actual SVG file and the file command recognized it as SVG Scalable Vector Graphics image. That is the gate I care about: prompt to bundle, bundle to scene DSL, scene DSL to SVG, SVG renders, and the output can be compared.

Dataset Generator And Report Script Map

The SVG training line was not one hand-written dataset. It became a generator stack. The important scripts are part of the research artifact because they define what the model was allowed to see, what the compiler expected, and what the probes measured. side note The dataset is code. For this project, the scripts are as important as the model weights. They encode the curriculum.

LayerRepresentative scriptsWhat changed
Raw SVG bootstrapgenerate_svg_instruction_dataset_v7.py, build_svg_pretrain_corpus_v7.py, build_svg_corpus_from_assets_v7.pyBuilt the first instruction/SVG and pretrain corpora from normalized SVG assets.
Normalization and asset auditnormalize_svg_assets_v7.py, audit_svg_assets_patterns_v7.py, classify_svg_assets_v7.pyMade the input corpus inspectable and showed why raw SVG had too many uncontrolled degrees of freedom.
Structured scenesgenerate_svg_structured_spec05_v7.py, generate_svg_structured_spec06_v7.py, spec06_infographic_content_v7.pyMoved from raw SVG toward bounded visual contracts and generated infographic content.
Scene DSL and renderersgenerate_svg_structured_spec07_v7.py, generate_svg_structured_spec08_v7.py, render_svg_structured_scene_v7.py, render_svg_structured_scene_rich_v7.pySeparated model-facing scene structure from deterministic SVG rendering.
Asset-backed DSLbuild_spec09_asset_library_v7.py, build_spec09_asset_alignment_report_v7.py, generate_svg_structured_spec10_v7.pyIntroduced reusable asset references so the model did not have to emit every visual detail literally.
Gold/compact contractsgenerate_svg_structured_spec12_v7.py, build_spec12_gold_budget_report_v7.py, build_spec12_probe_contract_v7.pyCompressed repeated payloads and created probe contracts for exact/materialized scoring.
Bundle routinggenerate_svg_structured_spec16_v7.py, spec16_scene_bundle_v7.py, spec16_bundle_lowering_v7.pyChanged the task from full scene reproduction to compact bundle routing and deterministic lowering.
Failure recoverygenerate_svg_structured_spec19_v7.py, build_spec19_probe_contract_v7.py, build_spec19_compiler_smoke_report_v7.py, build_bundle_probe_autopsy_v7.pyAdded balanced coverage, routebook prompts, minimal pairs, compiler smoke tests, and miss autopsies.
Training orchestrationspec10_pretrain_midtrain_v7.sh, spec12_pretrain_midtrain_v7.sh, spec19_balanced_coverage_pretrain_midtrain_v7.sh, spec19_sft_b_on_r3d_v7.shMade runs reproducible enough to leave ledgers: tokens, steps, loss, datasets, checkpoints, and probe reports.

Tokenizer And Binding Examples From The Actual Line

The tokenizer lesson is easier to see with concrete surfaces. The early raw-SVG target forced the model to learn XML syntax, geometry, text, colors, and closing tags at the same time. The later scene and bundle targets compressed the problem into explicit structural tokens. The final atomic-binding direction went further: visible text became a bound content source rather than literal prose the model had to memorize. side note Tokenizer design is architecture. For this lab, tokenization was not a preprocessing detail. It changed what the model could learn cleanly.

StageModel-facing surfaceWhy it mattered
Raw SVG<svg> ... <text>Every tensor has a fixed offset</text> ... </svg>Too much syntax and prose. Renderability can improve while semantic contract stays wrong.
Structured scene[scene] [canvas:wide] [layout:dashboard_cards] [section_card:Observe|loss_curve|trace_anomalies] [/scene]The model emits a controlled scene language. The compiler owns exact SVG geometry.
Bundle routing[bundle] [family:memory_map] [form:layer_stack] [theme:signal_glow] [segments:6] [/bundle]The model chooses family/form/topology. The compiler lowers it to a full scene/SVG.
Atomic bindings[component] [name] table_block [/name] [bind] groups 0 [/bind] [/component]The model chooses structure and binding paths. The content layer supplies final text and values.
Literal prose target versus binding targettext
Earlier style:
[section_card:Observe_End_to_End|loss_curve_to_kernel_dispatch|trace_anomalies_across_the_full_stack|variant=hero|accent=amber]

Later binding style:
[component] [name] section_card [/name]
[bind] slots cards 0 title [/bind]
[bind] slots cards 0 value [/bind]
[bind] slots cards 0 note [/bind]
[meta] variant hero [/meta] [meta] accent amber [/meta]
[/component]

Spec02: Raw SVG And Instruction Bootstrap

Spec02 was the “can this pipeline even train on SVG?” stage. The data lived in generated files such as spec02_sft_v2_instruction_train.txt, spec02_sft_v2_svg_train.txt, spec02_sft7_v1_instruction_train.txt, and spec02_sft7_mix_instruction_train.txt. The model-facing surface was still close to instruction-to-SVG rather than a clean compiler DSL.

The important result from spec02 was not that the model became good. The important result was that the training runtime could ingest generated SVG rows, tokenize them, train locally, and leave artifacts in the cache. The svg_l16_d128_h512_v1024_ctx512_spec02 run had an ascii_bpe tokenizer roundtrip pass over 60,000 input lines and about 2.97M tokens. That proved the pipeline could move data through the runtime. It did not prove the representation was good.

The mistake was visible early: raw SVG makes the model learn many things at once. It must learn syntax, geometry, XML closure, visual composition, numeric formatting, and prompt obedience. That is too much pressure for a small model if the goal is to understand the training mechanics.

Spec03: Tokenizer And Prompt-Atom Audit

Spec03 shifted the focus from “can it train?” to “what language is the model actually seeing?” The artifacts under data/spec03 include normalized assets, tokenizer corpus files, tag seed rows, reserved control tokens, fit audits, and pretrain materialization manifests.

The tokenizer was technically clean. The bootstrap runs passed exact roundtrip. The bad cross-row merge class like </svg>\n<svg was gone. But the prompt atom audit found the deeper issue: canonical control tags from spec03_tag_seed_rows.txt were not becoming stable learned pieces. The report records 0 / 50 exact learned pieces for those tags.

That is why BPE was not “bad” in the generic sense. BPE did its byte-reconstruction job. It did not automatically produce the control interface needed for a tiny formal-domain model. The conclusion was representation-first: reserve important control atoms, do not ask frequency merges to discover the DSL contract by accident.

Spec04: Structured Atoms And The Renderability Trap

Spec04 used structured SVG atoms and structured scenes. The formal manifest records two runs and eight stage records. The best probe result was 100% renderable and 0% exact. That is one of the most useful failures in the whole experiment.

The model could produce output that rendered, but it did not match the target contract. In visual tasks this is dangerous because a human can look at a rendered artifact and feel progress. The probe report tells the truth: renderability is syntax/renderer success, not semantic or contract success.

The spec04 iteration report also showed that composition-only midtrain helped narrow train behavior but did not generalize. The blended midtrain variant was better than composition-only, but still did not solve holdout exactness. That was an early warning that “add a narrow repair slice” is not always enough.

Spec04 generated red triangle output
Spec04 train_01 evidence: the prompt asked for a blue circle, but the model emitted a renderable badge/triangle structure. Valid SVG, wrong contract.
Spec04 actual model response text
Prompt: [task:svg] [layout:single] [shape:circle] [color:blue] [size:big] [bg:none] [OUT]
Expected: [svg] ... [circle] [cx:64] [cy:64] [r:30] [fill:blue] ...
Generated: [svg] ... [bg:mint] [layout:badge] [rect] ... [polygon] ... [fill:red] ... [/svg]
Metrics: exact=false, materialized_exact=false, svg_exact=false, valid_svg=true, renderable=true

Spec05 And Spec06: Bounded Structured Scenes Become Learnable

Spec05 moved into a stronger structured-scene run. The best canonical spec05 run reached 80.8% exact and 92.3% renderable. That was the first strong sign that the small model could learn a bounded visual contract when the target was cleaner.

Spec06 extended the idea into structured infographics. It ran seven rungs. The exact score moved unevenly across rungs: 25.0%, 75.0%, 29.2%, 38.9%, 69.4%, 91.7%, then 61.1%. This is important because it shows the empirical nature of the work. More rungs did not monotonically improve. The best spec06 rung was r6, not the last one.

The data was no longer arbitrary raw SVG. It was generated from scripts like generate_svg_structured_spec06_v7.py with structured infographic content support from spec06_infographic_content_v7.py. The DSL was still not the final scene language, but the direction was clear: the model should learn a controlled representation, and deterministic code should own as much rendering detail as possible.

Spec05 structured scene output
Spec05 evidence: bounded scene structure became easier to probe than raw SVG.
Spec06 structured infographic output
Spec06 evidence: structured infographic targets made the visual task more measurable.

Spec07 And Spec08: First Scene DSL, Then Richer Scene DSL

Spec07 introduced the first scene DSL grammar. It regressed: r1 reached only 5.6% exact and 69.4% renderable, while r2 improved to 38.9% exact and 75.0% renderable. This is why “use a DSL” is not a magic answer. A bad or incomplete DSL can still be hard for the model.

Spec08 made the scene DSL richer and cleaner. R1 reached 75.0% exact and 100% renderable. R2 reached 80.6% exact and 91.7% renderable. The lesson was not that richer always helps. The lesson was that the DSL needs the right boundary: enough structure to guide the model, not so much repeated payload that the tokenizer turns the language into brittle chunks.

Spec10: Asset Scene DSL

Spec10 brought asset-library thinking into the scene DSL. The script chain included build_spec09_asset_library_v7.py, build_spec09_asset_alignment_report_v7.py, and generate_svg_structured_spec10_v7.py. This shifted the task toward references and reusable visual assets instead of literal full SVG.

The run behavior was unstable but informative. R1 reached 33.3% exact and 97.2% renderable. R2 collapsed to 5.7% exact and 14.3% renderable. R3 recovered to 62.9% exact and 80.0% renderable. R4 reached 94.3% exact and 100% renderable. R5 collapsed to 0%.

This is the kind of result that makes training feel empirical rather than deterministic. The same broad goal can produce very different behavior depending on dataset staging, token budget, sequence surface, and exact run recipe. The response should not be “the model is random.” The response should be “the run recipe is part of the system.”

Spec10 asset scene DSL output
Spec10 evidence: asset-backed poster-stack scene. This is where the model started emitting structure that a compiler could materialize into a richer SVG.

Spec11: Keyed Scene DSL

Spec11 introduced keyed component vocabulary. The model no longer had to carry as much literal payload in the scene line. It could point at structured keys. This matters because content and layout are different problems. A scene DSL should say what structure exists and where content is referenced; it should not always inline every payload string.

The results show the difference. Spec11 r1 was weak: 5.7% exact and 25.7% renderable. Spec11 r2 reached 100% exact and 100% renderable. The smoke run failed, but the canonical r2 proved that keyed references were a powerful interface.

Spec12: Gold Scene DSL And The Rung-20 Confusion

Spec12 is where the project became a real lab. It had twenty rungs, multiple reports, compact gold mappings, budget analysis, prompt-to-SVG reports, hidden probes, and tokenizer manifests. This is also where I need to be precise: spec12 r20 exists, but that is not spec20.

The spec12 compression analysis explains the major representational change. The early gold mappings were structurally correct but too verbose. They repeated field-level bindings that the compiler could infer. The compression pass moved repeated component payloads into object-level refs like [table_block:groups.0] and kept topology refs where topology mattered, such as decision-tree node and edge ids.

Spec12 results were not monotonic. R1 was 0% exact. R3 reached 42.4% exact and 90.9% renderable. R4 regressed to 30.3% exact and 45.5% renderable. R7 and r8 collapsed to 0%. R15 reached 61.1% exact with 43.8% hidden exact. R17 reached 100% exact, 100% renderable, and 100% hidden exact. R18 also hit 100% visible exact, while r20 reached 97.2%.

The spec12 tokenizer manifest for r17 is a good snapshot of the mature interface: 1080 tokenizer rows, 324 tag-seed rows, 100 reserved control tokens, byte-perfect roundtrip, and a model-facing DSL with tokens like [task:svg], [layout:memory_map], [OUT], [scene], and family-specific structural tags. That is not raw SVG anymore. It is a compiler-facing language.

Spec12 table matrix generated output
Spec12 evidence: gold scene DSL compiled into a table/matrix infographic.

Spec13a And Spec13b: Intent And Planning Are Harder Than Reproduction

Spec13a tried to bridge intent prompts into scene DSL. The best visible exact rates were decent, but hidden exactness stayed much lower. R2 had 75.0% visible exact and 41.7% hidden exact. That gap matters. It means the model could satisfy seen-style prompts better than it could generalize the intent bridge.

Spec13b introduced a decision-tree IR and scene bridge. The early r1-r3 runs often had 0% exact while staying highly renderable. R4 improved to 50.0% visible exact and 66.7% hidden exact. The tree-scene branch stayed lower. The lesson was that planning IR gave structure, but routing exactness still needed its own training pressure.

Spec14 And Spec15: Bounded Families Work

Spec14a focused on comparison boards. It took many rungs, but eventually reached strong performance: r9 and r10 reached 100% visible exact and 90% hidden exact, while the manifest-level best was 97.1% exact and 100% renderable. Spec14b focused on timelines and reached 100% visible and hidden exact by r3.

Spec15a focused on memory maps. It was noisy for many rungs, then r9 reached 100% visible and hidden exact. Spec15b focused on system diagrams and reached 100% visible and hidden exact by r2.

This is the strongest argument for family-specific DSLs. When the family is bounded, the compiler is deterministic, and the probe contract is clear, a small model can become highly reliable. But that does not automatically solve the next problem: choosing the correct family from broader intent.

Spec16: Shared Bundle Contract

Spec16 tried to put multiple solved families under a shared bundle language. The intended pipeline was:

Spec16 pipeline text
upstream request/router
  -> shared scene bundle
  -> family lowerer
  -> family DSL
  -> deterministic compiler
  -> SVG

The spec16 contract explicitly disallowed arbitrary SVG and topic-bearing one-off hacks. The model was supposed to emit a shared scene_bundle.v1 with family, form, style controls, and topology counts. That is a higher-level DSL than spec12.

Spec16 had useful rungs and dangerous rungs. R11 had 75.0% visible exact, 94.4% renderable, and 100% hidden exact in the extracted stage ledger. R12 collapsed to 0% exact and 43.8% renderable in the autopsy. The autopsy is critical: r12 used the same unique row sets as r11 with different ordering and about one-third of the processed training compute. That was enough to collapse all three families.

The spec16 r12 failure signature included mixed-family tag soup, duplicated [bundle], copied prompt/control markers like [OUT] and [task:svg], missing stop markers, dirty tail text, missing singleton fields, and wrong-family topology keys. That is not a simple syntax bug. It is a global output-contract collapse.

Spec17 And Spec18: Routing Failure Becomes Visible

Spec17 and spec18 tried to repair the bundle-routing problem, but the numbers stayed bad. Spec17 r1 through r4 had hidden exactness at 0%. The visible exactness rose only to 23.3% at r3, then fell to 13.3% at r4 while renderability improved. That tells me the model was learning to produce more parseable fragments without solving the routing contract.

Spec18 r1 did not solve it either: 6.7% visible exact, 53.3% renderable, and 0% hidden exact in the stage ledger. This was the point where a narrow local patch was not enough. The task needed a more structured curriculum mixture.

Spec19: Textbook Routing Mixture

Spec19 is the recovery line. It did not simply add more data. It changed the teaching mixture. The curriculum blueprint covered three families, nine intent profiles, nine topics, eight goals, three audiences, ten surfaces, eight competencies, and twelve failure frontiers. The audit passed with no missing declared surfaces.

The rungs show the recovery:

Spec19 Rung Dataset Idea Visible Exact Hidden Exact Renderability Read
r1 Initial spec19 bundle line 6.7% 0.0% 56.7% Still mostly broken.
r2 Textbook routing mixture 71.9% 83.3% 84.4% The structured curriculum started working.
r3a Probe-miss delta replay 86.1% 83.3% 91.7% Targeted replay helped.
r3b Coherent replay 81.2% 83.3% 93.8% More coherent, but not strictly better on exactness.
r3c Cumulative neighbor replay 81.2% 83.3% 90.6% Incremental additions did not guarantee improvement.
r3d Balanced coverage replay 84.4% 91.7% 93.8% Best balanced pretrain/midtrain base.
r3e Route recovery replay 78.1% 66.7% 90.6% A targeted repair regressed hidden exactness.
r3f Cumulative balanced route recovery 81.2% 75.0% 87.5% More cumulative data did not beat r3d.
r4 Unified curriculum 59.4% 66.7% 84.4% Unification was broader, but worse.
r3d SFT Instruction SFT on r3d 81.2% 91.7% 87.5% Instruction helped hidden exactness but hurt renderability.
r3d SFT-B Second instruction SFT variant 87.5% 91.7% 93.8% Best packaged spec19 result.

This is where my current interpretation lands: a broader curriculum helped only when it was structured. Random broadening failed. Narrow repair also failed when it was attached to the wrong scaffold. The winning move was a structured mixture with enough breadth to teach the full route, and enough discipline to avoid turning the dataset into noise.

Spec19 memory map compiler smoke output
Spec19 r3d evidence: memory-map bundle routed into an allocator-region SVG.
Spec19 system diagram compiler smoke output
Spec19 r3d evidence: system-diagram bundle routed into a build-pipeline SVG.
Spec19 timeline compiler smoke output
Spec19 r3d evidence: timeline bundle routed into an IR-evolution timeline SVG.
Spec19 routebook contract example text
Prompt: [task:svg] [layout:memory_map] [form:arena_sections] ... [OUT]
Expected: [bundle] [family:memory_map] [form:arena_sections] ... [/bundle]
Generated: [bundle] [family:memory_map] [form:arena_sections] ... [/bundle]
Metrics: exact=true, materialized_exact=true, svg_exact=true, renderable=true
Failure still tracked: dirty tails, family misses, form misses, and special-token leakage.

After Spec19: The Gen1 Line

Spec19 was not the finish line. It was the point where the work changed naming systems. Instead of continuing as spec20, the later experiments used names that encode generation, representation, model size, context length, and rung directly. Examples:

  • gen1_scene_dsl_l3_d192_h384_ctx512_r0_bootstrap
  • gen1_full_scene_dsl_l3_d192_h384_ctx512_r2
  • gen1_full_scene_dsl_l3_d192_h384_ctx512_r3_recomb_holdout
  • gen1_full_scene_dsl_l3_d192_h384_ctx512_r4_concept_editorial
  • gen1_full_scene_dsl_l3_d192_h384_ctx512_r5_architecture_decision
  • gen1_full_scene_dsl_l3_d192_h384_ctx512_r6_balanced_all_assets_recomb
  • gen1_full_scene_dsl_l3_d192_h384_ctx512_r7_atomic_bindings
  • gen1_atomic_scene_dsl_l3_d192_h384_ctx2048_r1

The IR hub is important here because it shows the project becoming less like a single SVG experiment and more like a run-dashboard system. Each run has an ir_report.html, tokenizer artifacts, dataset QC, training pipeline files, parity checks, replay determinism, gradient accumulation regimens, backprop stitch runtime reports, training plans, weight manifests, and probe reports when available. That is a different maturity level from the early spec02/spec03 bootstrap work.

The fresh hub also makes the topology clearer. There are training runs, inference cards, report-only surfaces, and UI/demo entries. They should not all be interpreted as model-quality experiments. Some are there to prove the dashboard, some to prove runtime/inference paths, and some to preserve the training evidence. The important part is that the system started tracking the whole workflow, not only the final loss number.

Post-Spec19 Branch What It Represents Why It Matters
spec19 r1-r4 cluster Route recovery, coherent replay, cumulative neighbors, balanced coverage, instruction SFT, and unified curriculum. This is the repair ladder after spec17/spec18 exposed routing and scaffold failures.
spec_broader_1 and gen1_scene bootstrap The bridge from the spec-numbered sequence into the generation/rung naming system. This preserved the broader-scene failure signal while changing how runs were organized.
gen1_full_scene_dsl r2-r7 Full scene DSL, recombination holdouts, concept/editorial assets, architecture-decision assets, all-asset balancing, and atomic-binding pressure. This is the main post-spec19 training branch. It is where the work moved from recovery into broader generalization pressure.
gen1_atomic_scene_dsl ctx2048 Larger context and atomic-binding tokenizer stress. This exposed that context length is not enough if tokenizer/materialization roundtrip breaks.
python-ui-notebook-demo, reports, and inference cards Dashboard, demo, reporting, and inference-surface artifacts. These are not all training-quality claims. They show the surrounding lab infrastructure getting hardened.

Gen1 r0: Broader Scene DSL Bootstrap

The gen1_scene_dsl_l3_d192_h384_ctx512_r0_bootstrap run looked very similar to the older broad-scene failure. It had 42 probe cases, 2.4% exact, 54.8% renderable, and 2.4% materialized exact. The tested prompt report showed the model often produced renderable fragments but then spilled into long tails, repeated prompts, copied <|eos|>/<|bos|>-style boundaries, and mixed unrelated scenes.

That run taught the same hard lesson in a new naming system: broad scene DSL is not automatically solved just because earlier bounded families worked. The broader target reopened tail control, recombination, and exactness failures.

Gen1 Full r2: Full Scene DSL Worked On The Probe

gen1_full_scene_dsl_l3_d192_h384_ctx512_r2 was a major jump. Its tokenizer roundtrip passed over 15,540 input lines and about 1.14M tokens. The probe report had 42 cases with 100% exact, 100% renderable, and 100% materialized exact. Its run ledger recorded a pretrain stage of 257,536 tokens and a midtrain stage of 1,252,864 tokens.

This result is why spec19 cannot be treated as the final state. The gen1 full line showed that, under a fuller staged dataset and cleaner run artifact system, the broader scene DSL could be solved on the visible probe contract.

Gen1 full scene DSL output
Gen1 full r2 evidence: broader full-scene DSL solved the visible probe before recombination and new asset families made the task harder again.

Gen1 Full r3-r5: Recombination And New Asset Pressure

Then the line deliberately made the problem harder. r3_recomb_holdout added recombination holdout pressure. The result dropped to 43.8% exact, 100% renderable, and 62.5% materialized exact across 32 cases. This is an important distinction: the model could still render every case, but recombination exposed that exact structure had not generalized cleanly.

r4_concept_editorial added concept/editorial asset pressure. It reached 40.6% exact, 100% renderable, and 62.5% materialized exact. r5_architecture_decision reached 43.8% exact, 100% renderable, and 62.5% materialized exact. These did not beat r2, but they were not simple failures either. They showed that adding new semantic/content families preserved renderability while stressing exact matching and materialization.

This is the same pattern again: the model can look healthy if I only inspect renderability. The moment I demand exact DSL reproduction across recombined or newly broadened assets, the harder weakness appears.

Gen1 Full r6: Balanced All-Assets Recombination

gen1_full_scene_dsl_l3_d192_h384_ctx512_r6_balanced_all_assets_recomb scaled the dataset much more aggressively. The tokenizer roundtrip passed over 219,240 input lines and about 15.4M tokens, with 34,272 tokenizer rows. The pretrain ledger processed about 4.19M tokens before the next stage began.

This run is important even before treating it as a final capability result. It shows the line moving from small curated experiments into larger balanced asset mixtures. This is the direction I wanted: more assets, more recombination, more coverage, but still under a deterministic DSL and report system.

Gen1 Full r7: Text Became Bindings Instead Of Literal Prose

One later improvement was not just visual. It was textual. Earlier scene DSL rows often carried the full visible text inside the training target: headline strings, subtitles, table rows, notes, callouts, and metric labels. That made the model learn layout and prose memorization at the same time. Toward the end, the dataset moved toward atomic bindings and placeholders. The model could emit structure like [bind] slots header headline [/bind] or [bind] groups 0 [/bind], while the compiler/content layer supplied the final text.

That is a major architectural change. It means the model is no longer being asked to memorize every infographic sentence. It is being asked to pick the right component, the right slot, the right binding path, and the right topology. This is closer to the Antsand/HMVC idea: the model chooses a deterministic structure, and the renderer fills the clean final artifact from a controlled data source.

Later atomic-binding DSL example text
[task:svg] [layout:table_matrix] [topic:ir_v66_edge_case_matrix] [theme:infra_dark] [tone:mixed] [density:balanced] [frame:card] [background:none] [OUT]
[scene] [canvas] wide [/canvas] [layout] table_matrix [/layout] [theme] infra_dark [/theme] [tone] mixed [/tone]
[component] [name] header_band [/name] [bind] header [/bind] [/component]
[component] [name] legend_block [/name] [bind] legend [/bind] [/component]
[component] [name] table_block [/name] [bind] groups 0 [/bind] [/component]
[component] [name] table_block [/name] [bind] groups 1 [/bind] [/component]
[component] [name] note_band [/name] [bind] footer [/bind] [/component]
[/scene]

Spec19 had already moved in this direction at the bundle level with prompts like [content_pack:default] and [content_pack:brief]. The output could be a compact bundle such as [family:memory_map], [form:layer_stack], [segments:6], and [cards:3]. That is different from training on final polished prose. It is training the routing and structural contract.

This also explains why the polished documentation infographics should not be presented as generated outputs. The stronger claim is that the project learned why literal text should be stripped or bound: full prose in the target creates memorization pressure; placeholder/binding targets make the compiler boundary cleaner and the model task more measurable.

Gen1 Atomic ctx2048: Longer Context And Atomic Binding Stress

The atomic line moved to ctx2048 with gen1_atomic_scene_dsl_l3_d192_h384_ctx2048_r1. That run had a huge tokenizer surface: 34,272 tokenizer rows and about 54.9M tokens in the tokenizer roundtrip attempt. But the tokenizer roundtrip failed: exact_match=false and byte match rate was only about 1.9%.

That failure is extremely useful. It says the next frontier was not only model capacity or context length. The data/tokenizer contract itself broke at the larger atomic-binding surface. A longer context window does not help if the tokenizer/materialization boundary is wrong. This is the same tokenizer lesson from spec03, but at a much larger and more painful scale.

What Made Me Proceed Further

The reason to continue past spec19 was that the failure mode became more precise. Spec19 recovered routing. Gen1 r2 showed full scene DSL could hit a perfect visible probe. Gen1 r3-r5 showed recombination and new asset classes broke exactness while preserving renderability. Gen1 r6 showed the pipeline could scale to much larger balanced asset mixtures. Gen1 atomic ctx2048 showed the next bottleneck was tokenizer/materialization correctness at the atomic binding layer.

That is the real progress. Not “the model is good now.” The progress is that the question became sharper:

  • Can the full scene DSL generalize to recombined holdouts?
  • Can new editorial and architecture assets be added without exactness collapse?
  • Can a larger balanced all-asset mixture preserve the r2 behavior?
  • Can atomic bindings be represented without breaking tokenizer roundtrip?
  • Can ctx2048 help only after the tokenizer contract is repaired?

This also changes how I should describe the history. Spec19 is not the final recovery. Spec19 is the bridge from the spec-numbered curriculum into the gen1 run system. The newer line is trying to turn the solved pieces into a broader, inspectable, recombinable IR training workflow.

What The Model Generated And What The DSL Compiled Into

The useful thing about the probe reports is that they do not only store scores. They store the whole path from prompt to model text to parsed DSL to compiled SVG. For a probe case, the JSON records fields like:

  • prompt: the input given to the model.
  • expected_output: the target scene DSL.
  • raw_output and response_text: what the model actually generated.
  • parsed_output and parsed_output_tokens: the DSL portion extracted from the response.
  • tail_text: extra garbage after the first valid answer.
  • materialized_output and rendered_svg: what the DSL compiler produced.
  • render_error, valid_svg, renderable, exact_match, and materialized_exact_match: the actual gates.

That means the evidence is stronger than “the output looked good.” For each tested prompt, I can ask: did the model emit the right DSL, did the parser extract the right scene, did the compiler produce SVG, did the SVG render, and did the compiled output match the expected artifact?

Successful Example: Gen1 Full r2

In gen1_full_scene_dsl_l3_d192_h384_ctx512_r2, the first train probe asked for a poster stack about the C Kernel Engine overview:

Prompt text
[task:svg] [layout:poster_stack] [topic:c_kernel_engine_overview] [theme:infra_dark] [tone:amber] [density:compact] [frame:card] [background:rings] [OUT]

The expected DSL was a compact scene contract:

Expected scene DSL text
[scene] [canvas:tall] [layout:poster_stack] [theme:infra_dark] [tone:amber] [frame:card] [density:compact] [inset:md] [gap:sm] [hero:center] [columns:1] [emphasis:top] [rail:accent] [background:rings] [connector:line] [topic:c_kernel_engine_overview] [header_band:broader_poster|c_kernel_engine_overview|compiler_first_seed] [section_card:overview|c_kernel_engine_overview|stacked_explainer_contract|variant=hero|accent=amber] [section_card:coverage|broader_assets|more_visible_surface|variant=metric|accent=green] [compare_bar:baseline|48_units|current_surface|accent=amber] [compare_bar:expanded|86_units|broader_contract|accent=green] [section_card:next_move|freeze_the_wider_dsl|then_scale_the_dataset|variant=note|accent=blue] [footer_note:c_kernel_engine_overview_poster_contract] [/scene]

The model generated the same parsed DSL prefix. The response still had a long tail after <|eos|>, but the parser extracted the first valid scene. The gates were all green: exact_match=true, materialized_exact_match=true, svg_exact_match=true, valid_svg=true, and renderable=true.

The compiler turned the DSL into a full SVG poster: 960 by 1420 canvas, dark infra background, radial rings, gradient hero band, header text, section cards, comparison bars, callout note, and footer. That is the right division of labor. The model emits the scene contract. The compiler owns geometry, gradients, SVG tags, text placement, shadows, fills, strokes, and final XML.

Failure Example: Gen1 r0 Bootstrap

The gen1_scene_dsl_l3_d192_h384_ctx512_r0_bootstrap run showed the opposite behavior. It often produced a valid-looking first scene, then spilled into repeated prompt/scene fragments. A typical response started with a plausible [scene] but then continued with repeated <|eos|><|bos|> blocks and unrelated comparison-chart scenes.

That explains why r0 had only 2.4% exact while still being 54.8% renderable. The compiler could often render some extracted fragment. But the generation contract was not stable. The model had not learned clean stopping, clean routing, or clean single-scene closure.

Failure Example: Atomic ctx2048

The gen1_atomic_scene_dsl_l3_d192_h384_ctx2048_r1 run exposed a different failure. The issue was not only generated output quality. The tokenizer roundtrip itself failed on the larger atomic binding surface. The roundtrip attempted about 54.9M tokens across 56,268 input lines, but reported exact_match=false and a byte-match rate around 1.9%.

That means this branch should not be trusted as a clean training result yet. Before asking whether the model can learn atomic bindings with a 2048-token context, the data/tokenizer/materialization contract has to be repaired. This is exactly why the pipeline needs gates before training.

Where The Full Per-Run Evidence Lives

I should not paste every full generated SVG into this post. The SVG strings are enormous, and many runs contain dozens of probe cases. But the evidence is already stored locally in a consistent pattern:

Per-run evidence files text
/home/antshiv/.cache/ck-engine-v7/models/train/<run>/*_probe_report.json
/home/antshiv/.cache/ck-engine-v7/models/train/<run>/*_tested_prompts_report.md
/home/antshiv/.cache/ck-engine-v7/models/train/<run>/*_compiler_smoke_report.json
/home/antshiv/.cache/ck-engine-v7/models/train/<run>/ir_report.html
/home/antshiv/.cache/ck-engine-v7/models/ir_hub.html

For the next revision of this research bundle, the right move is to generate a dedicated appendix page from those JSON reports: one row per run, one representative success case, one representative failure case, the parsed DSL, the compiled SVG preview, and the exact/renderable/materialized flags. That appendix should be generated, not hand-written. The blog can then explain the results while the appendix carries the full per-rung evidence.

How The Dataset Pipeline Was Built

The training work was not hand-editing one text file. The C Kernel Engine v7 repository accumulated a small factory of scripts. That factory matters because the dataset is part of the model. If the generator is sloppy, the model learns slop. If the generator has a clear contract, the model has a chance to learn the contract.

Dataset Generators

The early generators created raw SVG and instruction-style rows:

  • generate_svg_instruction_dataset_v7.py generated instruction-to-SVG rows.
  • generate_svg_basics_curriculum_v7.py generated staged basics such as shapes, color, layout, cards, charts, and paths.
  • generate_svg_semantic_shapes_v7.py generated semantic toy-shape curricula.
  • generate_svg_toy_atoms_v7.py and generate_svg_structured_toy_v7.py generated the first controlled structured atom datasets.

Later generators became spec-specific:

  • generate_svg_structured_spec05_v7.py and generate_svg_structured_spec06_v7.py built the structured scene and infographic data.
  • generate_svg_structured_spec07_v7.py and generate_svg_structured_spec08_v7.py moved the task into scene DSL form.
  • generate_svg_structured_spec10_v7.py, spec11, and spec12 introduced asset-backed and keyed scene contracts.
  • generate_svg_structured_spec14a_v7.py, spec14b, spec15a, and spec15b generated family-specific DSLs.
  • generate_svg_structured_spec16_v7.py generated the shared scene-bundle contract.
  • generate_svg_structured_spec17_v7.py, spec18, and spec19 generated routing and repair curricula.

Materializers

The materializer scripts turned generated spec data into the actual train, midtrain, tokenizer, seen-prompt, hidden-prompt, render-catalog, and probe-contract artifacts consumed by the runtime:

  • materialize_spec04_structured_atoms_v7.py through materialize_spec16_scene_bundle_v7.py staged the main lineage.
  • materialize_spec17_scene_bundle_v7.py, materialize_spec18_scene_bundle_v7.py, and materialize_spec19_scene_bundle_v7.py staged the routing-repair line.
  • materialize_spec19_probe_miss_delta_v7.py, materialize_spec19_coherent_replay_union_v7.py, materialize_spec19_cumulative_neighbor_replay_v7.py, materialize_spec19_balanced_coverage_replay_v7.py, materialize_spec19_route_recovery_replay_v7.py, and materialize_spec19_unified_curriculum_v7.py created the spec19 replay variants.
  • materialize_spec19_sft_instruction_v7.py created the instruction-SFT branch on top of the stronger spec19 r3d base.

Renderers, Canonicalizers, And Lowerers

The renderer and canonicalizer scripts were as important as the training scripts. They defined what “correct” meant.

  • render_svg_structured_atoms_v7.py rendered structured atom outputs.
  • render_svg_structured_scene_v7.py and render_svg_structured_scene_rich_v7.py rendered early scene DSLs.
  • render_svg_structured_scene_spec12_v7.py, spec14a, spec14b, spec15a, spec15b, and spec16 rendered the later family and bundle contracts.
  • spec12_scene_canonicalizer_v7.py, spec14a_scene_canonicalizer_v7.py, spec14b_scene_canonicalizer_v7.py, spec15a_scene_canonicalizer_v7.py, spec15b_scene_canonicalizer_v7.py, and spec16_scene_bundle_canonicalizer_v7.py normalized model output before scoring.
  • spec16_bundle_lowering_v7.py lowered shared bundle descriptions into family-specific scene DSLs.

Preflight And Probe Scripts

The preflight scripts stopped me from training on broken assumptions. The probe scripts made the output measurable.

  • spec07_preflight_v7.py through spec19_preflight_v7.py checked dataset shape, tokenizer budgets, and staged artifact hygiene before training.
  • build_spec04_probe_contract_v7.py through build_spec19_probe_contract_v7.py defined visible and hidden prompt contracts.
  • build_probe_report_v7.py converted model generations into exact/renderable/materialized score reports.
  • build_probe_repair_eval_v7.py tested whether deterministic repair could save malformed outputs.
  • build_bundle_probe_autopsy_v7.py and the spec16/spec17/spec19 reports classified bundle-routing failure modes.

Training Launch Scripts

The shell launchers encoded the actual staged training choices: pretrain, midtrain, SFT, token budgets, context length, and run naming.

  • spec04_midtrain_sft_overnight_v7.sh, spec05_pretrain_midtrain_v7.sh, and spec06_pretrain_midtrain_v7.sh ran the early structured experiments.
  • spec07_pretrain_midtrain_v7.sh through spec16_pretrain_midtrain_v7.sh ran the DSL and bundle experiments.
  • spec19_probe_miss_delta_pretrain_midtrain_v7.sh, spec19_coherent_replay_pretrain_midtrain_v7.sh, spec19_balanced_coverage_pretrain_midtrain_v7.sh, spec19_route_recovery_pretrain_midtrain_v7.sh, and spec19_unified_curriculum_pretrain_midtrain_v7.sh ran the curriculum variants.
  • spec19_sft_instruction_on_r3d_v7.sh and spec19_sft_b_on_r3d_v7.sh ran instruction variants after the better r3d base existed.

This script layer is the reason I can write this report at all. Without it, the training history would be a pile of memories. With it, the history becomes a reproducible engineering trail.

What I Actually Trained

First, an important correction. I am not saying I trained Qwen3. I used Qwen-style transformer kernels and architecture ideas as a practical runtime reference inside C Kernel Engine. The SVG experiments used small transformer configurations that could run locally, produce reports, and expose the mechanics of training without needing frontier-scale compute.

The task was deliberately narrow: can a model learn to emit SVG-like or scene-DSL-like structures that can be parsed, materialized, rendered, evaluated, and compared across runs?

That sounds simple. It is not. Raw SVG is text, but it is not ordinary prose. It has strict syntax, nested structure, numeric coordinates, geometric relationships, style attributes, paths, layout decisions, and renderer expectations. A model can produce something that looks close to SVG but still fails because one tag is missing, one coordinate is malformed, one scene block is not closed, or one route family is confused.

This made SVG useful as a training laboratory. It was visual enough to be interesting, but structured enough that failures could be turned into measurable diagnostics.

The model-facing task changed over time. Early runs used SVG-like targets and bootstrap contracts. Later runs moved toward scene DSLs and visual bundle languages. The final SVG was increasingly treated as something a deterministic compiler should own, not something the model should hand-type from scratch.

Model-facing bundle shape text
[bundle]
[family:system_diagram]
[form:build_path]
[theme:signal_glow]
[tone:blue]
[density:balanced]
[background:mesh]
[stages:4]
[links:4]
[terminal:1]
[footer:1]
[/bundle]

That is closer to a circuit diagram for a visual system than to prose. The model chooses the family, form, routing, topology, and style controls. The compiler turns that into final SVG geometry, gradients, markers, wrapping, and renderer-safe XML.

Why I Used SVG Instead Of A Toy Text Task

A toy text task can make a model look successful too quickly. If the target is simple prose, the evaluation can become fuzzy: maybe the output is close enough, maybe the paragraph is acceptable, maybe the language sounds plausible. That is useful for product demos, but it is weak for learning training mechanics.

SVG is harder in the right way. It is text, but it has a renderer. It is structured, but it is still human-readable. It has syntax, hierarchy, style, geometry, and composition. It can fail at many layers:

  • The output can be invalid text for the expected DSL.
  • The output can be valid DSL but fail to compile into SVG.
  • The SVG can compile but fail XML parsing.
  • The SVG can parse but be visually or semantically wrong.
  • The output can be renderable but not exact.
  • The output can match the easy family while failing hidden routing cases.

That makes SVG a useful laboratory because it creates measurable failure surfaces. The model cannot hide behind plausible prose. Either the artifact satisfies the contract or it does not.

What I Did Not Train

I did not train a general image model. I did not train Qwen3 itself. I did not train a frontier model. I used Qwen-style transformer kernels and small local C Kernel Engine configurations to study a narrow generation problem. That matters because the scale is intentionally small. The point was to understand the mechanics of training and evaluation, not to claim frontier capability.

I also did not treat the generated SVG as an end-user product by itself. The generated visuals are test outputs. They are evidence that the pipeline can connect prompt, DSL, compiler, render, and probe. The design quality can improve later. The first goal was to make the training loop inspectable.

The Artifact Chain

The important output of the work is not a single cherry-picked SVG sample. The important output is the artifact chain:

  1. Generate or collect SVG examples.
  2. Normalize them into a training representation.
  3. Create train and holdout splits.
  4. Build or update tokenizer material.
  5. Train a small model through the C Kernel Engine training path.
  6. Generate reports for the run.
  7. Probe the model against specific prompts and contracts.
  8. Classify failures.
  9. Decide whether the next step is data repair, tokenizer repair, capacity change, DSL redesign, or evaluation repair.

That artifact chain matters because otherwise every training run becomes a vague memory. I trained something, it kind of worked, then I changed something else, and two weeks later I no longer know why the run improved or collapsed.

I have started packaging the lightweight evidence trail as a research bundle: probe summaries, run reports, curriculum audits, tested prompt reports, selected SVG outputs, and transcript context. I intentionally exclude model checkpoints and build folders from that bundle because they are large and not the first thing a reader needs. The first thing worth preserving is the process record.

Curated artifact bundle text
workspace/contentCreation/13-svg-training-research-report/artifacts/
  svg-training-research-bundle.zip

Included:
  probe summaries
  selected CKE v7 reports
  curriculum audits
  tested prompt reports
  selected SVG outputs
  video transcript context

Excluded:
  model checkpoints
  tokenizer binaries
  generated C build folders
  large cache artifacts

The committed C Kernel Engine documentation is part of the same evidence chain. The important pages are the method page, the results page, the curriculum page, the tokenizer page, and the v7 runbooks. These are not decorative docs. They are the public ledger for how the training loop is supposed to be operated.

What Each Artifact Is For

I want each run to leave behind enough artifacts that I can reconstruct the decision months later. The artifact types have different jobs:

Artifact Purpose Question It Answers
Train report Records the runtime-level training and graph/report status. Did the training run execute through the expected CKE path?
Eval contract Defines the expected prompt/output behavior for a spec. What did this run promise to satisfy?
Probe report Runs prompts through the trained model and scores outputs. Did the model satisfy the actual contract?
Tested prompts report Shows prompt, expected output, response, and pass/fail state. What did the model literally emit?
Curriculum audit Checks generated training coverage against a declared blueprint. Did the dataset actually cover the intended families and surfaces?
Probe autopsy Classifies misses by failure type. What should the next repair run target?
Sample SVGs Stores compiler-smoke visual outputs. Can the DSL/bundle be materialized into visible artifacts?

Why This Matters For My Broader Workflow

This is not only about SVG. This is the same pattern I want across Antsand, C Kernel Engine, and eventually robotics hardware. The system should not be a pile of disconnected experiments. It should produce traces, reports, artifacts, and promotion gates.

For Antsand, the controlled language generates deterministic websites. For C Kernel Engine, the controlled language generates training and inference artifacts. For robotics, the future controlled language should generate schematics, KiCad files, Zephyr device-tree overlays, simulation files, and bring-up tests. The SVG training work is the AI-training version of that same philosophy.

Why I Call It Curriculum

In public AI discussion, I mostly hear the words “training” and “fine-tuning.” Those words are useful, but they are too coarse for what I was doing here.

Curriculum is the better word because the model was not simply being fed more data. The learning task changed in stages. The representation changed. The target contract changed. The probe changed. The data mixture changed. The model had to learn easier structures before harder structures, and when it failed, the repair rows had to target the failure instead of blindly adding more examples.

A curriculum is not just a dataset. It is an ordering and pressure system:

Layer Question Failure Mode
Representation Should the model emit raw SVG or a smaller scene DSL? Raw SVG becomes too brittle and token-expensive.
Tokenizer Are important syntax boundaries represented cleanly? The model leaks tokens, breaks tags, or mishandles delimiters.
Data mix Which families are overrepresented or underrepresented? One visual family learns while another collapses.
Capacity Is the model too small, or is the representation bad? Bigger layers hide bad data instead of fixing it.
Evaluation Are we measuring the actual contract? The loss improves but the output is unusable.

This is why I think the word curriculum should be used more often. A model does not learn in a vacuum. It learns the pressure we put on it. If the pressure is badly shaped, the model can become good at the wrong thing.

Spec And Run Discipline

I used the language of specs and runs because I needed a way to separate a new experiment contract from a revision inside the same contract.

A spec means the learning problem changed in a material way: a new representation, a new DSL boundary, a new prompt surface, a new output family, or a new compiler/evaluation contract. A run revision means the main contract is still the same, but I am changing one pressure point: data mix, repair rows, token budget, epoch count, decoding hygiene, or balance.

This distinction matters. If a run fails, the wrong reaction is “start over” or “make the model bigger.” The better reaction is to classify the failure first.

The practical failure loop

Observe the weakness, classify the weakness, teach that weakness directly, rerun or repair inside the same spec if the contract is still valid, and only start a new spec when the old representation has reached a real measured ceiling.

The Tokenizer Problem Was Not Cosmetic

Tokenization became one of the clearest examples of why training is a systems problem. In ordinary text, a tokenizer can split words into subwords and still produce useful language. In a formal DSL, the wrong token boundary can destroy the experiment.

The committed method page records the important rule: reserve structure, not payload. If the tokenizer turns a whole component row into one atomic token, the model is no longer learning a compositional scene language. It is learning oversized one-off IDs.

Tokenizer Boundary Why It Fails Or Helps
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber] This is too atomic. It hides the reusable structure inside one payload-heavy token.
[compare_bar] [field:label] [@compare_bar.0.label] [field:value] [@compare_bar.0.value] [/compare_bar] This is more compositional. The model can learn component type, field role, reference, and closure separately.

This is the kind of issue that loss alone will not explain cleanly. A run can look like it is learning while the tokenizer is forcing the wrong abstraction. In the SVG work, tokenizer design, DSL design, and compiler design had to move together.

Tokenizer Failure Classes

The tokenizer problems were not all the same. I started to think of them as separate failure classes:

Failure Class What It Looks Like Likely Fix
Structural split failure Bracketed tags break into awkward subpieces, so the model fights spelling instead of structure. Reserve small structural tokens during early formal-language runs.
Payload-as-token failure Whole component rows become atomic reserved tokens. Reserve structure only; leave payload references compositional.
Special-token leakage <|eos|>, <|bos|>, or prompt fragments appear after a valid output. Fix row boundaries, decode stopping, adapter extraction, and special-token supervision.
Stop-boundary confusion The model emits a good prefix but misses [/scene] or [/bundle]. Add clean closure rows and verify the decoder is not stripping the structural close token.
Token-budget distortion Long scenes truncate or the run processes fewer effective examples than expected. Measure packed token budgets from the actual tokenizer before launching serious runs.

The Rule I Would Keep

Early in a formal DSL line, optimize for interpretability. Use explicit structural tokens so failures are easy to classify. Later, after the DSL stabilizes, revisit token efficiency with BPE, hybrid SVG-BPE, or a smaller reserved-token surface. Do not optimize for token efficiency before you know which contract you are trying to teach.

The DSL Was The Real Compression Layer

The scene DSL was not just a convenience. It was the compression boundary between model reasoning and deterministic rendering.

Raw SVG asks a small model to learn too many things at once: layout intent, geometry, XML syntax, color systems, path syntax, gradients, marker IDs, text wrapping, and renderer compatibility. A scene DSL lets the model learn the higher-level decision surface and leaves geometry to a compiler.

Structure/content/compiler split text
prompt
  -> model emits scene.dsl or bundle.dsl

content system
  -> supplies labels, numbers, copy, references

scene.dsl + content.json
  -> deterministic compiler
  -> compiled.svg
  -> render/probe report

This split is similar to how I think about Antsand. The model should not hand-edit a final website or hand-draw every coordinate. It should operate through a controlled language, and a deterministic system should generate the final polished artifact.

What The Model Should Own

The model should own decisions that are semantic or structural:

  • Which family should this artifact use?
  • Which form inside the family matches the intent?
  • What style family, tone, density, and background are appropriate?
  • How many stages, segments, cards, links, or arrows are needed?
  • Which content roles should be present?

What The Compiler Should Own

The compiler should own decisions that are brittle, geometric, or renderer-specific:

  • Exact coordinates and spacing.
  • SVG XML boilerplate.
  • Gradient definitions, filters, marker IDs, and reusable defs.
  • Text wrapping and label placement.
  • Shape sizing, alignment, and collision avoidance.
  • Final validation and renderer compatibility.

This separation is important because it turns the model from a fragile SVG typist into a planner. That is the same direction I want for broader AI workflows: the model should choose intent through a constrained interface, while deterministic systems generate the final artifact.

What The Probe Reports Showed

The probe reports made the training story visible. They tracked whether outputs were exact, renderable, materialized correctly, and valid as SVG. That is a much better signal than staring only at loss.

A few examples from the cached run reports:

Spec Best observed run Exact Renderable SVG exact Interpretation
Spec 04 structured_scenes 0.0% 100.0% 0.0% The output could render, but it did not match the intended contract.
Spec 06 structured_infographics_r6 91.7% 100.0% 91.7% A narrow structured task became learnable when the contract was clear.
Spec 10 asset_scene_dsl_r4 94.3% 100.0% 97.1% The richer DSL started to behave when representation and evaluation aligned.
Spec 12 scene_dsl_r17 100.0% 100.0% 100.0% Within a bounded contract, the model could satisfy the probe.
Spec 16 scene_bundle_r11 81.2% 95.8% 83.3% Bundled scene routing was harder than single-family scene generation.
Spec 17 scene_bundle_r3 16.7% 61.9% 16.7% The next curriculum step exposed routing and family-balance weakness.
Spec 19 balanced_coverage / instruction variants 88% range 95% range 88% range Balanced coverage and curriculum repair recovered much of the contract.

The numbers are not meant as a leaderboard. They are an engineering ledger. They show that some changes helped, some changes hurt, and some apparent wins were narrow. That is the point. A real training workflow has to preserve those facts.

Metric Definitions

The metrics have to be separated because they catch different kinds of failure:

Metric Meaning Why It Matters
Exact The model output matches the expected DSL or bundle contract. Measures contract fidelity.
Renderable The output can be converted into a renderable artifact. Measures structural viability, but not correctness.
Materialized exact The compiled/materialized result matches the expected artifact-level target. Separates model output correctness from compiler correctness.
SVG exact The final SVG-level output matches the expected SVG-level target. Useful when exact SVG output is a real requirement.
Budget truncation The model was cut off by output/context budget. Separates learning failure from decode-budget failure.

Why Split Metrics Beat A Single Score

A single score hides the failure layer. Spec04 shows why. If I only looked at renderability, I could claim success. If I only looked at exactness, I might miss that the syntax path was improving. The split tells the truth: the run learned enough to produce renderable structure, but not enough to satisfy the intended contract.

Spec19 shows the opposite kind of nuance. Exactness recovered strongly, but the autopsy still found misses. That means the next step is not “do everything again.” The next step is targeted: family/form contrast, syntax cleanup, and special-token hygiene.

Spec19: What Recovery Actually Looked Like

Spec19 is useful because it shows a concrete recovery path after the spec17/spec18 collapse. The curriculum blueprint covered three families: memory maps, timelines, and system diagrams. It used nine intent profiles, ten prompt surfaces, and eight competencies. The goal was not to add random examples. The goal was to teach specific missing behaviors.

Spec19 Surface What It Taught
Explicit bundle anchor Preserve clean bundle syntax and stop behavior.
Routebook direct Route from topic, goal, and audience into family and form.
Form minimal pair Distinguish sibling forms inside the same visual family.
Family minimal pair Prevent one family from becoming the default attractor.
Style/topology bridge Infer theme, tone, density, background, counts, and topology after routing is stable.
Routebook paraphrase Handle wording shifts without changing the target bundle.

The best packaged spec19 probe report had 44 cases. It hit 88.6% exact, 95.5% renderable, 88.6% materialized exact, and 0% budget truncation. The autopsy recorded five misses: one family miss, two form misses, and two syntax misses. That is actionable. It tells the next run where to apply pressure.

This is what I mean by a proper training ledger. The output is not “good” or “bad.” It is a map of exact failures.

The Spec19 Curriculum Shape

Spec19 had three visual families:

  • memory_map: layer stacks, typed regions, arena sections.
  • timeline: milestone chains and stage sequences.
  • system_diagram: linear pipelines, build paths, and selection paths.

It also had nine intent profiles and ten prompt surfaces. That matters because the model was no longer only responding to explicit tags. It had to route intent into a visual family and form.

Example Intent Pressure

A prompt like “pick one bundle only for audience operator topic arena_guard_flow goal route_debug” should become a memory-map style bundle, not a timeline or system diagram. That is not just syntax. That is routing.

Why Minimal Pairs Matter

Minimal pairs are useful because they force contrast. If two prompts differ only in a small routing clue, and the target family changes, the model has to learn the semantic boundary. Without that pressure, it can drift toward the most common family or the easiest form.

Loss Curves Versus Probe Reality

The clearest reason not to trust loss alone is visible in the cached ledgers. Several runs reached low minimum training loss while the probe contract stayed weak. That is not a contradiction. It means the model learned local token patterns or train-distribution fragments without solving the intended output contract. side note Loss is not the product contract. The probe asks the real question: did the output parse, route, render, and materialize correctly?

RunFirst lossFinal lossMin lossExactRenderableMaterialized
Spec04 structured scenes5.57460.32910.00580.0%100.0%0.0%
Spec06 infographics r16.47130.02300.006625.0%100.0%29.2%
Gen1 full r5 architecture6.44560.11720.031043.8%100.0%62.5%
Spec07 scene DSL r25.74090.36650.040738.9%75.0%44.4%
Gen1 full r4 editorial6.42240.11770.060340.6%100.0%62.5%
Gen1 full r3 recomb5.83031.08450.079643.8%100.0%62.5%
Gen1 scene r0 bootstrap5.68381.64630.08072.4%54.8%2.4%
Spec17 bundle r32.99481.50500.523116.7%61.9%16.7%
Spec19 r14.31210.84430.84434.8%66.7%4.8%
Spec18 bundle r13.02401.04101.04104.8%64.3%4.8%

This is why every promoted run needs two gates: optimization health and contract health. A falling loss curve can tell me the runtime is learning something. It cannot tell me whether the output closes, routes to the right family, compiles, renders, or matches the materialized artifact.

Loss Going Down Is Not Enough

One of the most important lessons is that loss is only one lens. It is useful, but it is not sufficient.

A model can reduce loss by learning local token patterns while still failing the product contract. It can learn to emit plausible fragments without closing the scene. It can learn valid-looking SVG while failing exact materialization. It can learn the easy family and ignore the hard family. It can memorize without generalizing. It can generate something renderable but semantically wrong.

That is why the evaluation surface has to be designed with the dataset. If the task is DSL-to-SVG materialization, then the eval must know about parse validity, renderability, materialization, exactness, family routing, budget truncation, and closure. Generic loss cannot tell that whole story.

The Ablation Lessons

The ablation guide in the bundle is short, but it captures the practical ordering I now trust more:

  1. Validate generalization before scaling model size.
  2. Run tokenizer sweeps at fixed small model and fixed data.
  3. Expand and clean data after choosing the best tokenizer surface.
  4. Scale model size only if underfitting remains.
  5. Start instruction tuning only after the base contract is stable.

That ordering matters because increasing model size can hide the wrong problem. If the tokenizer is bad, the dataset is leaky, or the DSL is too literal, a larger model may simply memorize more of the bad interface. It may improve a number without improving the system.

The minimum useful run set in the ablation note was also disciplined: compare ASCII, byte-BPE, and hybrid SVG-BPE tokenization first; then test expanded deduplicated data; then test augmented data. That is closer to engineering than guessing.

What An Ablation Should Protect Against

The danger in small-model training is that every change feels productive. Add rows. Change tokenizer. Change context length. Change model width. Change decoder. Change prompt surface. Change compiler. Then the run improves or collapses and you do not know why.

A real ablation protects against that by changing one primary axis at a time:

Ablation Axis Hold Fixed Question
Tokenizer Dataset, model size, train budget, probe prompts. Is the token surface helping or hurting the grammar?
Data mix Tokenizer, model size, compiler, probe prompts. Which families or failure classes need more pressure?
Context length Dataset, tokenizer, model size, compiler. Are failures caused by truncation or by learning?
Model capacity Representation, tokenizer, data mix, eval. Is the model underfitting after the contract is clean?
Instruction tuning Base contract and evaluation gates. Does controllability improve after the base behavior is stable?

That is why I do not want to scale first. Scaling too early can make a broken representation look better without actually making the system stronger.

Empirical Does Not Mean Random

When I talk about deciding whether to add layers, increase embedding size, change context length, add data, or improve the tokenizer, I do not think that process is fully deterministic. It is empirical.

But empirical does not mean random. This is an important distinction.

The deterministic side is real: the forward pass, backward pass, kernel math, generated C, memory plan, tokenizer files, dataset rows, and probe code are concrete. If those are wrong, the run is wrong.

The empirical side is also real: which data mixture teaches the target best, which representation gives the model the right compression surface, whether capacity is the bottleneck, whether the tokenizer is helping or harming, and whether the eval is measuring what I actually want.

Frontier labs probably have very mature internal versions of this loop. They may not have a complete mechanistic theory of every internal circuit in a trained model, but they almost certainly understand the operational process deeply: data mixture, architecture, objective, tokenizer, scale, evals, post-training, failure analysis, and promotion gates.

That is the level I am trying to build intuition for, just at my own scale.

Why Data And Algorithm Have To Move Together

The data and the algorithm cannot be separated cleanly. If the model architecture is too small, the data may look impossible. If the data representation is bad, a larger model may only memorize the bad structure. If the tokenizer is wrong, the model fights syntax instead of learning geometry. If the eval is weak, the run can look successful while the artifact is useless.

This is where curriculum became practical. I could ask:

  • Should this be raw SVG, or should the model emit a scene DSL that deterministic code compiles into SVG?
  • Should text content be generated by the model, or should it be bound later by a content layer?
  • Should I repair circles, cards, charts, layouts, paths, or route selection?
  • Should this be a narrow repair run, or does the representation itself need a new spec?
  • Should I scale model capacity now, or would that just hide a weak contract?

Those are not abstract questions. They decide what the next training run should do.

The Practical Decision Tree

After a run, I want the next-step decision to be almost mechanical:

  1. If the output is invalid syntax, inspect tokenizer boundaries and closure rows first.
  2. If the output is valid but wrong family, add family contrast and routebook pressure.
  3. If the output is right family but wrong form, add sibling-form minimal pairs.
  4. If the output is correct prefix but dirty tail, fix stop handling and special-token leakage.
  5. If the output truncates, fix decode budget before blaming learning.
  6. If all of that is clean and the model still fails broadly, then consider capacity.

That is not full determinism. But it is disciplined enough to avoid random wandering.

Why SVG Became A Good Test Bed

SVG has a useful property: it is both text and structure. That makes it frustrating, but it also makes it measurable.

A generated paragraph can be fuzzy. A generated SVG either parses or it does not. A scene DSL either closes its tags or it does not. A materialized image either matches the expected route or it does not. That gives the training loop a feedback surface.

At the same time, SVG is not trivial. If the goal is more than a circle and a rectangle, the model has to learn composition: layout, hierarchy, grouping, style, repetition, color, icon families, and sometimes visual intent. That is why the task quickly moved from raw SVG toward scene representations and compiler-facing structures.

The Actual Generated Visual Families

The bundle includes sample compiler-smoke SVG outputs from three families: memory maps, system diagrams, and timelines. These are not proof of artistic quality. They are proof that the model-facing contract can be materialized through deterministic code into visible artifacts.

Family Example Outputs In The Bundle What It Tested
Memory map memory_map_*_allocator_regions.svg Regions, brackets, cards, topology counts, and allocator-style visual organization.
System diagram system_diagram_*_build_pipeline_flow.svg Stages, links, terminals, footer handling, and pipeline-style visual routing.
Timeline timeline_*_ir_evolution_timeline.svg Milestones, stage sequences, arrows, and temporal composition.

I think of these as training circuits. A circuit is not useful because it is a screenshot. It is useful because every node has a role, every edge has a contract, and a failure can be traced. These SVG families gave me that kind of test bench for training.

Concrete Examples From The Runs

There is no spec01 artifact in the current bundle, so I am starting from the first packaged evidence: spec02/spec03 bootstrap and then the later probe reports. The useful progression is not one clean animation from bad to good. It is a sequence of contract changes.

Stage Example Behavior Adjustment Lesson
Spec02 / Spec03 Bootstrap SVG training reports existed, but the important result was proving the CKE v7 train/report path. Preserve train reports, eval contracts, and run folders instead of relying on memory. The first output of training is not a pretty sample. It is a traceable experiment.
Spec04 Outputs could render, but exactness was still 0%. Separate renderability from exact contract fidelity. A visual can appear valid while still failing the target.
Spec10 The model often copied the scene DSL almost perfectly but missed the closing [/scene]. Add closure discipline, stop-token hygiene, and stricter probe accounting. One missing structural token can turn a good-looking run into a failed run.
Spec11 / Spec12 Bounded DSL contracts reached 100% probe exactness. Keep the DSL narrow and compiler-backed while proving the grammar. Small models can satisfy formal contracts when the target is shaped correctly.
Spec17 / Spec18 Routing into family, form, style, and topology collapsed. Split routing from style/topology inference and add balanced contrast rows. Widening the task too fast exposes real curriculum limits.
Spec19 Most bundle outputs became exact/renderable again, but misses still showed family/form confusion and special-token leakage. Use autopsies, minimal pairs, routebook prompts, and clean stop anchors. The next run should repair specific failure classes, not blindly add data.

Spec10 Failure Example: Almost Correct But Not Closed

One spec10 probe showed the model copying a long scene correctly but omitting the final close token. The expected target ended with:

Expected closure text
[footer_note:C_Kernel_Engine_CPU_first_LLM_inference] [/scene]

The response ended with:

Generated response text
[footer_note:C_Kernel_Engine_CPU_first_LLM_inference]

That is a small-looking miss, but it changes the contract. The model did not need more artistic skill. It needed stronger structural closure pressure and cleaner stop handling.

Spec19 Failure Example: Correct Shape, Dirty Tail

In spec19, the bundle contract became much cleaner, but some responses still leaked special tokens and additional prompt fragments after a correct bundle. A representative response started correctly:

Spec19 generated prefix text
[bundle] [family:system_diagram] [form:build_path] [theme:signal_glow] [tone:blue] [density:balanced] [background:mesh] [stages:4] [links:4] [terminal:1] [footer:1] [/bundle]

But the raw decoded tail could continue with <|eos|>, <|bos|>, and copied prompt fragments. That taught a different lesson: exact parsing may be recoverable if the adapter stops at the first valid closed bundle, but the training line still needs cleaner tokenizer and decoding hygiene.

Prompt To Generated Output: Concrete Failure And Recovery Cases

The most useful artifacts are not the pretty SVGs. They are the prompt/expected/generated triples. These show exactly where the contract broke: closure, routing, family/form selection, recombination, or tail hygiene. side note The bad outputs are the lab notes. A wrong bundle is more informative than a pretty gallery image because it shows what the next dataset must teach.

Spec10 closure failure

Spec10 closure failure: prompt / expected / generatedtext
Prompt:
[task:svg] [layout:comparison_span_chart] [topic:performance_balance] [theme:infra_dark] [tone:green] [density:compact] [OUT]

Expected:
[scene] [canvas:wide] [layout:comparison_span_chart] [theme:infra_dark] [tone:green] [frame:none] [density:compact] [inset:md] [gap:sm] [hero:center] [columns:3] [emphasis:center] [rail:none] [background:rings] [connector:bracket] [topic:performance_balance] [header_band:headline|The_Performance_Gap|Same_floor_very_different_spans] [compare_bar:GPU_Compute|67000_GB_s_eq|5000x_total_span|accent=amber|note=HBM_is_fast_but_capacity_bound] [compare_bar:CPU_Compute|1800_GB_s_eq|144x_total_span|accent=green|note=Fit_and_cost_are_closer_to_deployment_reality] [axis:Log_scale_height|Same_ethernet_floor] [legend_row:amber=GPU_cluster|green=CPU_server] [annotation:Bottleneck_Shift|Memory_and_network_d ...

Parsed/generated:
[scene] [canvas:wide] [layout:comparison_span_chart] [density:compact] [tone:green] [frame:none] [density:compact] [inset:md] [gap:sm] [hero:center] [columns:3] [emphasis:center] [rail:none] [background:rings] [connector:bracket] [topic:performance_balance] [header_band:headline|The_Performance_Gap|Same_floor_very_different_spans] [compare_bar:GPU_Compute|67000_GB_s_eq|5000x_total_span|accent=amber|note=HBM_is_fast_but_capacity_bound] [compare_bar:CPU_Compute|1800_GB_s_eq|144x_total_span|accent=green|note=Fit_and_cost_are_closer_to_deployment_reality] [axis:Log_scale_height|Same_ethernet_floor] [legend_row:amber=GPU_cluster|green=CPU_server] [annotation:Bottleneck_Shift|Memory_and_network_dominate|accent=amber] [divider:dash] [span_bracket:GPU_span|5360x] [span_bracket:CPU_span|144x] [floor_band:Ethernet_same_floor_for_every_cluster] [thesis_box:GPU_structural_mismatch|CPU_gap_is_closeab ...

Flags: exact=False renderable=True materialized_exact=True missing_stop=None tail=<|eos|><|bos|>[task:svg] [layout:dual_panel_compare] [topic:ethernet_equalizer] [theme:paper_editorial] [tone:mixed] [density:compact] [edit:density=balanced] [OUT] [scene] [canvas:wide] [layout:dual_panel_compare] [theme:paper_editorial] [tone:mixed] [frame:p ...

Spec17 routing failure

Spec17 routing failure: prompt / expected / generatedtext
Prompt:
[task:svg] [layout:memory_map] [form:layer_stack] [theme:signal_glow] [tone:green] [density:balanced] [background:mesh] [segments:6] [brackets:1] [cards:3] [OUT]

Expected:
[bundle] [family:memory_map] [form:layer_stack] [theme:signal_glow] [tone:green] [density:balanced] [background:mesh] [segments:6] [brackets:1] [cards:3] [/bundle]

Parsed/generated:
[bundle] [family:memory_map] [form:layer_stack] [theme:signal_glow] [tone:green] [density:balanced] [background:none] [segments:6] [brackets:1] [cards:3] [/bundle]

Flags: exact=False renderable=True materialized_exact=False missing_stop=False tail=<|eos|><|bos|>[task:svg] [layout:memory_map] [form:typed_regions] [theme:paper_editorial] [tone:blue] [density:balanced] [background:mesh] [segments:5] [brackets:0] [cards:3] [OUT] [bundle] [family:memory_map] [form:typed_regions] [theme:signal_glow] [tone:blu ...

Spec18 still weak

Spec18 still weak: prompt / expected / generatedtext
Prompt:
[task:svg] [layout:memory_map] [form:arena_sections] [theme:paper_editorial] [tone:amber] [density:airy] [background:grid] [segments:6] [brackets:0] [cards:3] [OUT]

Expected:
[bundle] [family:memory_map] [form:arena_sections] [theme:paper_editorial] [tone:amber] [density:airy] [background:grid] [segments:6] [brackets:0] [cards:3] [/bundle]

Parsed/generated:
[bundle] [family:memory_map] [form:arena_sections] [theme:paper_editorial] [tone:blue] [density:airy] [background:grid] [segments:6] [brackets:0] [cards:3] [/bundle]

Flags: exact=False renderable=True materialized_exact=False missing_stop=False tail=<|eos|><|bos|>[task:svg] [layout:memory_map] [form:layer_stack] [tone:blue] [density:airy] [background:grid] [background:none] [segments:6] [brackets:1] [cards:3] [OUT] [bundle] [family:memory_map] [form:layer_stack] [theme:paper_editorial] [tone:blue] [densit ...

Spec19 recovered but dirty tail

Spec19 recovered but dirty tail: prompt / expected / generatedtext
Prompt:
[task:svg] [layout:memory_map] [form:layer_stack] [theme:signal_glow] [tone:green] [density:balanced] [background:mesh] [segments:6] [brackets:1] [cards:3] [OUT]

Expected:
[bundle] [family:memory_map] [form:layer_stack] [theme:signal_glow] [tone:green] [density:balanced] [background:mesh] [segments:6] [brackets:1] [cards:3] [/bundle]

Parsed/generated:
[bundle] [family:memory_map] [form:typed_regions] [theme:infra_dark] [tone:blue] [density:balanced] [background:none] [segments:5] [brackets:0] [cards:3] [/bundle]

Flags: exact=False renderable=True materialized_exact=False missing_stop=False tail=<|eos|><|bos|>pexatlan exctly for mpathatond ft. [family:memory_map] [form:typed_regions] [form:build_path]

Gen1 recombination miss

Gen1 recombination miss: prompt / expected / generatedtext
Prompt:
[task:svg] [layout:poster_stack] [topic:activation_memory_infographic] [theme:infra_dark] [tone:amber] [density:compact] [frame:card] [background:rings] [OUT]

Expected:
[scene] [canvas:tall] [layout:poster_stack] [theme:infra_dark] [tone:amber] [frame:card] [density:compact] [inset:md] [gap:sm] [hero:center] [columns:1] [emphasis:top] [rail:accent] [background:rings] [connector:line] [topic:activation_memory_infographic] [header_band:broader_poster|activation_memory_infographic|compiler_first_seed] [section_card:overview|activation_memory_infographic|stacked_explainer_contract|variant=hero|accent=amber] [section_card:coverage|broader_assets|more_visible_surface|variant=metric|accent=green] [compare_bar:baseline|48_units|current_surface|accent=amber] [compare_bar:expanded|86_units|broader_contract|accent=green] [section_card:next_move|freeze_the_wider_dsl|th ...

Parsed/generated:
[scene] [canvas:tall] [layout:poster_stack] [theme:infra_dark] [tone:amber] [frame:card] [density:compact] [inset:md] [gap:sm] [hero:center] [columns:1] [emphasis:top] [rail:accent] [background:rings] [connector:line] [topic:commodity_hardware_pattern] [header_band:broader_poster|commodity_hardware_pattern|compiler_first_seed] [section_card:overview|commodity_hardware_pattern|stacked_explainer_contract|variant=hero|accent=amber] [section_card:coverage|broader_assets|more_visible_surface|variant=metric|accent=green] [compare_bar:baseline|48_units|current_surface|accent=amber] [compare_bar:expanded|86_units|broader_contract|accent=green] [section_card:next_move|freeze_the_wider_dsl|then_scale_the_dataset|variant=note|accent=blue] [footer_note:commodity_hardware_pattern_poster_contract] [/scene]

Flags: exact=False renderable=True materialized_exact=False missing_stop=False tail=<|eos|><|bos|>[task:svg] [layout:comparison_span_chart] [topic:live_balance] [theme:paper_editorial] [tone:blue] [density:compact] [frame:card] [background:grid] [OUT] [scene] [canvas:wide] [layout:comparison_span_chart] [theme:paper_editorial] [tone:blue] [fr ...

Visual Evidence Gallery: From Toy Shapes To Compiler-Smoke Infographics

These are actual SVG artifacts copied from the local training cache and compiler-smoke outputs. The early images are intentionally simple. They show why the project moved away from raw SVG and toward DSL-backed generation. The later images show the richer compiler path after the model-facing language became more disciplined.

Spec02 source SVG asset used in the SVG training corpus
Spec02/source: a normalized raw SVG asset from the bootstrap corpus.
Spec04 renderable but incorrect toy geometry output
Spec04: renderable but wrong. The prompt wanted circle/rect structure; the model produced simple red geometry.
Spec05 structured scene circle card SVG output
Spec05: bounded structured scene. Simple, exact, and easier to measure.
Spec06 structured infographic compare panel SVG output
Spec06: structured infographic output. The target is no longer only toy geometry.
Spec10 asset-backed poster stack SVG output
Spec10: asset scene DSL. The model emits structure; assets and compiler do more work.
Spec12 table matrix infographic SVG output
Spec12: gold scene DSL table matrix. This is the bounded DSL direction working cleanly.
Gen1 full scene DSL poster stack SVG output
Gen1 full r2: broader full scene DSL solved the visible probe before harder recombination pressure.
Spec19 memory map compiler-smoke SVG output
Spec19: memory-map routebook bundle compiled to SVG.
Spec19 system diagram compiler-smoke SVG output
Spec19: system-diagram routebook bundle compiled to SVG.
Spec19 timeline compiler-smoke SVG output
Spec19: timeline routebook bundle compiled to SVG.

Evidence Boundary Table

This distinction matters because I already almost made the report misleading. The polished C Kernel Engine docs infographics are useful as source/corpus/runtime context, but they are not proof that this small SVG model generated polished diagrams from scratch. side note Do not overclaim the images. Source assets, compiler outputs, and model-generated DSL are different evidence types. Mixing them would make the report weaker, not stronger.

Artifact typeWhat it provesWhat it does not prove
Source/corpus SVGThe training corpus had visual examples or documentation assets to learn from.It does not prove the trained model generated that image.
Model-generated DSL textThe model emitted a structure, bundle, route, or binding sequence.It does not by itself prove the final SVG was correct.
Compiler-materialized SVGThe generated DSL could be parsed and lowered into a renderable artifact.It may still be semantically wrong if the route or binding is wrong.
Probe exact/materialized exactThe generated contract matched the expected target under the probe.It is still bounded to that probe distribution and does not prove broad visual intelligence.
Docs/concept infographicThe repository has polished explanatory assets and target style references.It should not be described as model output unless the probe report shows it came from that run.

Correction: What Counts As Model-Generated Evidence

I need to separate the evidence clearly. The polished C Kernel Engine documentation diagrams and concept infographics are source/corpus/runtime artifacts. They are useful context for what the SVG training task is trying to learn, but I am not claiming that this small SVG model generated those polished diagrams.

The model-generated evidence in this report is narrower: probe/materialized SVG outputs from trained runs, plus compiler-smoke SVGs created from model-facing bundle or DSL outputs. Source/corpus artifacts teach the task. Probe and compiler-smoke outputs show what the trained system actually emitted or materialized.

The important thing is not that these are final design assets. The important thing is that they connect a prompt, a DSL or bundle, a compiler, a rendered SVG, and a probe report. The progression is the point: raw SVG corpus, toy geometry, bounded DSL, asset-backed scene DSL, and then bundle-to-family compiler-smoke outputs.

How I Would Present The Generated SVGs Honestly

I would not present these images as “look, the model is now a designer.” That is the wrong claim. The right claim is narrower and stronger:

A small locally trained transformer can be taught to emit a bounded visual planning language. A deterministic compiler can then materialize that language into SVG. Probe reports can tell us whether the generated planning language matched the expected contract. That is the valuable part.

This matters because it points to a general pattern. If the model can emit a controlled SVG bundle, then later it can emit other controlled artifacts: Antsand page descriptions, KiCad block intents, Zephyr device descriptions, simulation configs, or data-preparation recipes. The SVG work is the visual proof of the compiler-facing model pattern.

The Subscription / Research Bundle Angle

One thing I want to preserve is the record of how the system changed. The blog can explain the story, but the artifacts prove the story.

That opens a useful future path: a public post can describe the method, while a packaged bundle can contain the deeper evidence trail: probe reports, run ledgers, autopsies, curriculum manifests, selected outputs, and scripts. That bundle could eventually sit behind a subscription, course, or research archive, not because it is secret magic, but because it is a structured record of the work.

The value is not only the final answer. The value is the process: what failed, what improved, what was measured, what was abandoned, and what should be tested next.

Reproducibility And Artifact Paths

The report is public-facing, but the useful research record is the artifact tree. The paths below are the anchors I used for this write-up.

ArtifactLocal path patternUse
Run directories~/.cache/ck-engine-v7/models/train/<run_name>/One folder per training run.
Probe reports*probe_report.json and *probe_report.htmlExact/renderable/materialized scores plus prompt/output cases.
Probe contracts*probe_contract.jsonExpected prompts, outputs, rendered SVGs, and content JSON.
Training ledgerrun_ledger.jsonlStages, datasets, tokens, steps, losses, and checkpoints.
Tokenizer manifesttokenizer_manifest.jsonTokenizer line, workspace, reserved/control-token information, and roundtrip evidence where present.
Dataset profiledataset_profile.jsonLine counts, character counts, duplicates, and dataset surface checks.
Compiler smoke*compiler_smoke_report.json, *compiler_smoke/*.svgProof that model-facing DSL/bundles can lower into SVG files.
Run hub~/.cache/ck-engine-v7/models/ir_hub.htmlDashboard over the broader run surface after spec19.

What I Would Do Next

The next version of this work should harden the training ledger even more:

  1. Make every spec declare its dataset, tokenizer, eval contract, and promotion gate.
  2. Keep per-family metrics instead of only aggregate exact/renderable rates.
  3. Track whether failures are syntax, routing, materialization, context, tokenizer, or capacity failures.
  4. Separate memorization tests from generalization tests.
  5. Package every promoted run into a reproducible artifact bundle.
  6. Only scale model depth or embedding size after the representation and evals are stable.

That last point matters. Scaling is not the first move. Scaling is the move after the smaller experiment proves that the contract is worth scaling.

What Spec20 Should Actually Test

Since the current bundle does not contain a spec20 artifact, spec20 should not be a vague “make it better” run. It should answer a specific next question.

Spec20 Candidate A: Clean Tail And Stop Hygiene

Freeze the spec19 bundle contract and focus only on eliminating dirty tails, prompt leakage, and special-token leakage. Promotion should require the first closed bundle to be correct and the raw decode to be clean.

Spec20 Candidate B: Family/Form Minimal Pair Repair

Freeze tokenizer and compiler, then build a targeted minimal-pair dataset around the exact family/form misses found in the spec19 autopsy. Promotion should require hidden split improvement, not just train recovery.

Spec20 Candidate C: Tokenizer Boundary Ablation

Keep the same spec19 prompts and targets, then compare reserved structural tokens against a more compositional token surface. This should measure whether the tokenizer is making family/form learning easier or harder.

My Preferred Next Gate

I would do Candidate A first. Dirty tail behavior is not a design problem; it is a hygiene problem. If the model can emit a valid bundle but the raw decode leaks special tokens or prompt fragments, that should be fixed before widening the task again.

What I Still Do Not Know

This report should not pretend to be more than it is. The work taught me a lot, but several claims are still outside the evidence. side note The uncertainty is part of the artifact. A useful research log should say what it proved, what it did not prove, and what the next gate must test.

  • I do not know how close this is to how frontier labs actually structure pretraining or post-training at scale.
  • I have not proved broad visual intelligence. I proved bounded SVG/DSL contract learning under local probes.
  • I have not proved the model can design polished infographics from scratch. The honest claim is compiler-facing planning plus deterministic materialization.
  • I have not solved tokenizer design permanently. The gen1 atomic ctx2048 run shows that the tokenizer/materialization boundary can break again at larger surfaces.
  • I have not proved the generated artifacts are production-quality design assets. They are research artifacts and evidence of a workflow.
  • I have not finished the subscription/research bundle packaging. The artifacts exist, but the packaging, checksums, and reproducible run scripts still need hardening.

Conclusion

The honest conclusion is that training even a small useful model is hard. Not mysterious in a magical way, but hard in a systems way. Every layer matters: data, tokenizer, architecture, loss, evals, runtime, reports, and the discipline to not fool yourself.

That is why I like this work. It forces the training process to become concrete. Instead of saying “the model learned” or “the model failed,” I can ask a better question: which contract did it satisfy, which contract did it break, and what artifact proves that?

That is the direction I want C Kernel Engine training to go. Not hype. Not vague demos. A reproducible, inspectable, CPU-native training system where every run leaves behind enough evidence to decide what should happen next.