This ShivasNotes lab note studies how a real vision-language model turns images into tokens that a language decoder can use. It connects earlier notes on activations, normalization, attention, residual connections, full backprop, optimizers, and DeltaNet. This is where those abstractions stop being classroom diagrams and become one real production vision-language stack.
Speaker note. The headline of this post is not just that Qwen3-VL can see images. It is that the C-Kernel-Engine v8 source declares that vision path with an explicit header → body → footer template. That pattern is not a teaching metaphor. It is literally the code-generation contract.
Roadmap for this post
First, we explain the vision header: patchify, dual projection, biasing, and 2D position handling.
Then we walk the 27-layer ViT body, the M-RoPE attention geometry, the DeepStack branches, and the footer projector.
Finally, we connect the encoder to the multimodal bridge, the text decoder, a SigLIP comparison, the memory profile, and the larger lesson of the series.
Introduction — From Attention to Vision
In the earlier attention note, we derived attention as a routing rule over tokens. In the residual-connections note, we saw why deep stacks stay trainable. In the full-backprop note, gradients were flowing end to end. In the optimizer note, AdamW was nudging those parameters into something useful. And in the DeltaNet note, we looked at recurrent alternatives when full attention is not always the cheapest memory contract.
The natural next question is: what does all of that look like when the input is not text, but an image? How does a real vision-language model turn pixels into tokens that a language decoder can reason over? Qwen3-VL is one of the strongest open multimodal systems in that class, and C-Kernel-Engine v8 gives us an unusually clear answer because the architecture is exposed as templates plus generated C.
Quick vision-encoder mental model
A vision encoder usually starts by cutting an image into patches, projecting each patch into an embedding vector, adding spatial position information, and then running those patch embeddings through transformer blocks. The result is no longer “pixels.” It is a sequence of visual tokens.
A vision-language model then has one more job: convert those visual tokens into the same kind of representation the language decoder expects. That bridge is what lets a decoder answer questions about an image using the same transformer machinery it uses for text.
The crucial idea is simple. The image encoder is not a mysterious monolith. It is a pipeline with a header, a repeated body, and a footer. Header means “convert image grid into transformer-ready patch tokens.” Body means “run the repeated ViT layers.” Footer means “compress and project those vision features into the embedding space the language model expects.”

This is why this post matters in the arc of the series. It is where attention, normalization, activations, matrix multiplies, residuals, and projection layers all appear in one auditable production path.
27 + 36 Qwen3-VL in the provided CKE v8 material uses a 27-layer vision encoder feeding a 36-layer text decoder.The core narrative
Header/body/footer is not a storytelling convenience. In version/v8/templates/qwen3_vl_vision.json, it is the actual block sequence that the code generator lowers into kernels, buffers, and unrolled encoder code.
The Template Architecture — Header, Body, Footer
C-Kernel-Engine v8 organizes model definition in three conceptual stages described in version/v8/templates/PIPELINE.md: Template → GraphIR → LoweredIR. The template stage is architecture-agnostic. It names the ops and their order. Later stages bind those ops to actual kernels and generated C code.
For Qwen3-VL vision, the template file is version/v8/templates/qwen3_vl_vision.json. It explicitly declares one block type called vision_encoder. That block type contains a sequence field set to ["header", "body", "footer"]. There is no metaphor here. This is the source of truth that tells the runtime what must happen first, what repeats, and what gets emitted at the end.

{
"version": 3,
"name": "qwen3_vl_vision",
"family": "vision_transformer_with_branches",
"flags": {
"patch_frontend": "dual_patch_proj_sum",
"activation": "gelu",
"normalization": "layernorm"
},
"block_types": {
"vision_encoder": {
"sequence": ["header", "body", "footer"],
"header": [
{"id": "patchify", "op": "patchify"},
{"id": "patch_proj", "op": "patch_proj"},
{"id": "patch_proj_aux", "op": "patch_proj_aux"},
{"id": "patch_sum", "op": "add_stream", "params": {"merge_size_from_config": "spatial_merge_size"}},
{"id": "patch_bias", "op": "patch_bias_add"},
{"id": "position_embeddings", "op": "position_embeddings", "params": {"merge_size_from_config": "spatial_merge_size"}},
{"id": "vision_position_ids", "op": "position_ids_2d", "params": {"position_rank": 4, "merge_size_from_config": "spatial_merge_size"}}
],
"body": {
"type": "dense",
"ops": [
{"id": "attn_norm", "op": "layernorm"},
{"id": "qkv_packed_proj", "op": "qkv_packed_proj"},
{"id": "split_qkv", "op": "split_qkv_packed"},
{"id": "vision_mrope", "op": "rope_qk", "params": {"rope_mode": "vision", "position_rank": 4, "section_policy": "equal_4way"}},
{"id": "attn", "op": "attn"},
{"id": "attn_out_proj", "op": "out_proj"},
{"id": "attn_residual", "op": "residual_add"},
{"id": "ffn_norm", "op": "layernorm"},
{"id": "mlp_up", "op": "mlp_up"},
{"id": "mlp_gelu", "op": "gelu"},
{"id": "mlp_down", "op": "mlp_down"},
{"id": "mlp_residual", "op": "residual_add"}
]
},
"footer": [
{"id": "final_norm", "op": "layernorm"},
{"id": "merge_main", "op": "spatial_merge"},
{"id": "projector_fc1", "op": "projector_fc1"},
{"id": "projector_gelu", "op": "projector_gelu"},
{"id": "projector_fc2", "op": "projector_fc2"},
{"id": "deepstack_concat", "op": "branch_concat", "output": "vision_embeddings"}
]
}
}
}Header means image preprocessing into token tensors. Body means repeated transformer blocks. Footer means cross-modal projection into the language-model interface. That exact decomposition is present in the JSON before any C is generated.
| Stage | What it does | Why it exists |
|---|---|---|
| Header | Image → patches → projected embeddings + spatial positions | Transforms raw pixels into transformer tokens. |
| Body | 27 repeated ViT layers with attention and MLP | Learns contextual visual features across the whole patch grid. |
| Footer | Normalize, merge, project, concatenate branch outputs | Produces vision_embeddings in the shape the decoder can consume. |
If you have been following the series, this should feel familiar. We started with local ideas like matrix multiplication and layer normalization. The template shows how those operations become a software architecture.
The deepest lesson of CKE v8 is that production model architecture is legible when the template language is honest about stages, ops, and data movement.Header Step 1 — Patchify (im2patch)
Vision transformers do not ingest a [3, 448, 448] image the way a CNN does. They first cut it into fixed-size patches. In the Qwen3-VL setup described in the prompt, patch size is P = 14. So a 448 × 448 image becomes a 32 × 32 patch grid because 448 / 14 = 32. That yields 32 × 32 = 1024 patches.
Each patch contains 3 × 14 × 14 = 588 scalar values. So after patchification the image is no longer stored conceptually as height-width pixels. It is stored as a token matrix of shape [1024, 588]. This is the exact moment vision becomes “tokenizable” and therefore transformer-compatible.

void im2patch(const float *image, float *patches, int C, int H, int W, int P)
{
const int grid_h = H / P;
const int grid_w = W / P;
const int patch_dim = C * P * P;
for (int ph = 0; ph < grid_h; ++ph) {
for (int pw = 0; pw < grid_w; ++pw) {
const int patch_idx = ph * grid_w + pw;
float *out = patches + (size_t)patch_idx * patch_dim;
int k = 0;
for (int c = 0; c < C; ++c) {
for (int py = 0; py < P; ++py) {
for (int px = 0; px < P; ++px) {
const int iy = ph * P + py;
const int ix = pw * P + px;
out[k++] = image[(size_t)c * H * W + (size_t)iy * W + ix];
}
}
}
}
}
}Notice how explicit the memory walk is. The kernel loops over patch rows and columns, then loops inside each patch over channels and local offsets. The output is one contiguous patch vector at a time.
That means the patch index is already being flattened into token order before any transformer logic begins.
1024 patches For448 × 448 inputs with P = 14, the vision encoder starts with a 32 × 32 patch grid. Why patchify matters after the matrix-shapes note
In the matrix-post part of the series, we treated linear algebra shapes as the real story. Patchify is the shape bridge for vision. It changes an image tensor into the token matrix that the later GEMMs and attention blocks know how to process.
| Object | Shape | Count |
|---|---|---|
| Input image | [3, 448, 448] | 602,112 floats |
| Patch grid | [32, 32] | 1,024 locations |
| Flattened patches | [1024, 588] | 602,112 floats |
Header Step 2 — Dual Patch Projection
Once the image has become [1024, 588] patches, Qwen3-VL does something richer than a standard single projection. The template declares two parallel ops: patch_proj and patch_proj_aux. Both read the same patch matrix. Both project it into the encoder embedding width. In the generated configuration, that width is EMBED_DIM = 1152.
So the header performs two GEMMs of the form [1024, 588] × [588, 1152] → [1024, 1152]. The two outputs are not kept separate forever. They are merged by patch_sum, whose runtime helper is described as add_stream_reorder_2d in src/kernels/vision_kernels.c lines 384-412. That merge is not a naïve row-wise add. It includes merged-tile spatial reordering.
The dual-projection frontend gives the model two learned views of the same raw patch content before the main transformer stack begins. One reasonable engineering interpretation is that the auxiliary branch enriches local visual features without changing the downstream transformer contract.
// Patches from im2patch
float patches[1024][588];
// Main learned projection
float patch_proj[1024][1152] = GEMM(patches, W_main);
// Auxiliary learned projection
float patch_proj_aux[1024][1152] = GEMM(patches, W_aux);
// Spatially reordered merged-tile addition
float patch_sum[1024][1152] = add_stream_reorder_2d(
patch_proj,
patch_proj_aux,
merge_size);
// Bias then positional information
float patch_bias[1024][1152] = patch_sum + b_patch;That ordering matters. The model does not add position first and project later. It first learns a token embedding from pixel content, then merges the two streams, then applies a bias, then injects positional structure. In other words, content channels are formed before spatial context is layered on top.
There is a subtle systems point here. Because add_stream_reorder_2d handles merged-tile ordering, the encoder can keep the patch tokens arranged the way later merge operations expect. The header is already thinking ahead to the footer.
| Header op | Input shape | Output shape |
|---|---|---|
patch_proj | [1024, 588] | [1024, 1152] |
patch_proj_aux | [1024, 588] | [1024, 1152] |
patch_sum | two [1024, 1152] streams | [1024, 1152] |
patch_bias | [1024, 1152] | [1024, 1152] |
Header Step 3 — Position Embeddings (2D M-RoPE)
Text is naturally one-dimensional. Token 7 comes after token 6. Images are different. A patch has both a vertical coordinate and a horizontal coordinate. So the vision encoder needs a 2D position system, not just a single running index.
Qwen3-VL handles that in two related ways. First, position_embeddings_add_tiled_2d in src/kernels/vision_kernels.c lines 158-239 adds learned position embeddings with bilinear interpolation, so the runtime can adapt when the actual patch grid differs from the training grid. Second, vision_position_ids_2d_merge in lines 250-287 emits four position streams for M-RoPE.

void vision_position_ids_2d_merge(int32_t *positions, int grid_h, int grid_w, int merge_size)
{
// Output: [4, T] where T = grid_h * grid_w
// Four streams: [y, x, y, x] for M-RoPE sections
const int T = grid_h * grid_w;
int idx = 0;
// Merged-tile traversal order
for (int bh = 0; bh < grid_h; bh += merge_size) {
for (int bw = 0; bw < grid_w; bw += merge_size) {
for (int dh = 0; dh < merge_size && (bh + dh) < grid_h; ++dh) {
for (int dw = 0; dw < merge_size && (bw + dw) < grid_w; ++dw) {
int y = bh + dh;
int x = bw + dw;
positions[0 * T + idx] = y; // stream 0: y
positions[1 * T + idx] = x; // stream 1: x
positions[2 * T + idx] = y; // stream 2: y (duplicate)
positions[3 * T + idx] = x; // stream 3: x (duplicate)
idx++;
}
}
}
}
} The line that matters most is the output convention: [y, x, y, x]. That is the multi-section rotary layout for vision. Instead of one scalar position per token, each patch gets four coordinated streams that will later rotate different quarters of the query and key vectors.
Also notice the traversal order. It is not simple raster scan. The function walks merged tiles first, then offsets within each tile. That means the positional streams are aligned to the same merged spatial logic used elsewhere in the encoder.
Vision positions are generated in the order the model wants to reason about merged tiles, not just in the order a human would read image rows.Flexible resolution is an engineering choice
Bilinear interpolation in position_embeddings_add_tiled_2d means the model is not hard-coded to one exact training grid at runtime. The source grid of learned embeddings can be smoothly resampled when the real patch grid changes.
| Position mechanism | Purpose | Source reference |
|---|---|---|
| Learned 2D positional embeddings | Add coarse spatial prior to patch tokens | position_embeddings_add_tiled_2d, lines 158-239 |
| 2D M-RoPE position IDs | Drive rotary phase rotations for Q and K | vision_position_ids_2d_merge, lines 250-287 |
| Merged-tile ordering | Keep positions aligned with spatial merge semantics | same function, inner tile loops |
Body — The Vision Transformer (27 Layers)
Once the header is done, the encoder is holding a patch-token matrix of width 1152 with content and spatial information already attached. The body of the model then repeats a dense transformer block 27 times. In the generated file build/regression-runs/qwen3vl/multimodal_bridge/encoder/encoder_v8.c, those 27 layers are unrolled. The configuration given in the prompt is EMBED_DIM 1152, NUM_HEADS 16, HEAD_DIM 72, and ROTARY_DIM 72.
This is a normal transformer rhythm, but applied to image tokens rather than text tokens: normalize, build QKV, rotate Q and K, attend across tokens, project back, add residual, normalize again, run the MLP, and add the second residual. If you remember the attention and MLP posts from the series, you already know the conceptual pieces. The novelty here is how they are staged in a real vision encoder.
{
"type": "dense",
"ops": [
{"id": "attn_norm", "op": "layernorm"},
{"id": "qkv_packed_proj", "op": "qkv_packed_proj"},
{"id": "split_qkv", "op": "split_qkv_packed"},
{"id": "vision_mrope", "op": "rope_qk", "params": {"rope_mode": "vision", "position_rank": 4, "section_policy": "equal_4way"}},
{"id": "attn", "op": "attn"},
{"id": "attn_out_proj", "op": "out_proj"},
{"id": "attn_residual", "op": "residual_add"},
{"id": "ffn_norm", "op": "layernorm"},
{"id": "mlp_up", "op": "mlp_up"},
{"id": "mlp_gelu", "op": "gelu"},
{"id": "mlp_down", "op": "mlp_down"},
{"id": "mlp_residual", "op": "residual_add"}
]
}Attention: The Core Of The Transformer derived the attention scores. Activation Functions explained why nonlinear activations such as GELU matter. LayerNorm And RMSNorm covered normalization. Residual Connections explained residual additions. Here, the Qwen3-VL body stitches all four into one repeated production block.
| Step in one layer | Role | Output width |
|---|---|---|
attn_norm | Stabilize input statistics before attention | 1152 |
qkv_packed_proj + split_qkv | Form per-head queries, keys, and values | 16 × 72 per stream |
vision_mrope | Inject 2D phase structure into Q and K | same shape |
attn + attn_out_proj | Mix information across all patches | 1152 |
ffn_norm + MLP | Per-token nonlinear feature transformation | 1152 |
There is one especially important difference from the decoder side. This encoder body is not causal. It is bidirectional. Every patch may attend to every other patch because the model is describing an already fully visible image, not predicting an unknown future token.
16 heads × 72 The provided generated encoder configuration multiplies to the full vision width:16 × 72 = 1152. Body Detail — Vision M-RoPE
Standard text RoPE rotates query and key coordinates according to a one-dimensional token index. Vision M-RoPE is more structured. According to the prompt, the relevant implementation is mrope_qk_vision() in src/kernels/rope_kernels.c. It consumes the four position streams [y, x, y, x] and divides the head dimension into four equal sections.
Here the head dimension is 72. With equal_4way section policy, the split is 72 / 4 = 18 features per section. Section 0 is rotated with the first y stream. Section 1 uses x. Section 2 uses y again. Section 3 uses x again.

The duplicated y and x streams are not redundant noise. They let the rotary structure spread 2D geometry across multiple frequency bands inside the head. That gives attention a richer way to sense vertical and horizontal displacement.
| Head slice | Width | Position stream |
|---|---|---|
| Section 0 | 18 | y |
| Section 1 | 18 | x |
| Section 2 | 18 | y |
| Section 3 | 18 | x |
Vision RoPE versus text RoPE
In the prompt material, the vision encoder uses theta = 10000.0 with kernel = mrope_qk_vision. The text decoder later switches to theta = 5000000.0 with kernel = mrope_qk_text and a 1D multi-section layout. Same family of idea, different geometry.
Without 2D-aware rotation, attention can know content similarity but not cleanly encode that one patch is above, below, left, or right of another. M-RoPE turns spatial coordinates into phase structure the dot products can actually feel.
Body Detail — Bidirectional Attention
Text generation uses causal masking because future tokens are supposed to stay hidden. Vision encoding has no such constraint. The whole image is already present. A patch in the top-left corner must be able to compare itself with a patch in the bottom-right corner immediately.
That is why the vision attention in this stack is dense and bidirectional. The prompt explicitly frames it as attn_variant: "dense_bidirectional" with causal: false. The referenced execution path uses attention_forward_full_head_major_gqa_ggml_strided.
This is conceptually the same attention math from Attention: The Core Of The Transformer. Queries score keys, softmax produces routing weights, and values are blended. What changes is the mask. For text decode, the mask is triangular. For image encode, the mask is fully open.
Vision attention is global scene reasoning, not next-token prediction. The patch grid behaves like a fully visible set, not a partial prefix.| Property | Vision encoder | Text decoder |
|---|---|---|
| Masking | Bidirectional / full | Causal / past-only |
| Reason | All image patches are already known | Future text must stay hidden during generation |
| Common operation | QK score → softmax → weighted V | QK score → softmax → weighted V |
| Main geometry | 2D patch positions | 1D token positions |
This is also why a vision encoder does not need a long-lived autoregressive KV-cache in the way a decoder does. The image is processed in one pass. The attention workspace is temporary. Once the encoder has produced the final vision tokens, the decoder no longer needs per-layer image KV state from the encoder.
DeepStack Branches — Multi-Layer Feature Extraction
Qwen3-VL adds something more specialized than a plain ViT footer. The template contains a branch named deepstack. Its tap point is declared as body.mlp_residual.out at a set of deepstack_layer_indices from configuration. So intermediate layer outputs are not discarded. Selected ones are harvested and fed into their own mini-projector path.
This is a strong design choice. Instead of trusting only the final layer to summarize the entire image hierarchy, the encoder preserves some intermediate representations. That lets the eventual multimodal projector see a blend of lower-level and higher-level features.

{
"name": "deepstack",
"kind": "fixed_branch",
"tap": {
"from": "body.mlp_residual.out",
"layers_from_config": "deepstack_layer_indices"
},
"producer": {
"ops": [
{"id": "merge", "op": "branch_spatial_merge"},
{"id": "norm", "op": "branch_layernorm"},
{"id": "fc1", "op": "branch_fc1"},
{"id": "gelu", "op": "branch_gelu"},
{"id": "fc2", "op": "branch_fc2"}
]
},
"collect": {
"mode": "concat",
"axis": "feature",
"target": "branch.deepstack"
}
}The branch logic mirrors the main footer logic in miniature. It spatially merges, normalizes, expands through a feed-forward block, applies GELU, then projects again. Only after that does it join the main stream by feature concatenation.
DeepStack says the final layer is not always the whole story. Useful vision information can live at multiple semantic depths, so Qwen3-VL taps it before it disappears.| Branch stage | Purpose |
|---|---|
tap | Select chosen encoder layers from body.mlp_residual.out. |
branch_spatial_merge | Compress local spatial neighborhoods before projection. |
branch_layernorm | Stabilize branch features before MLP processing. |
branch_fc1 → branch_gelu → branch_fc2 | Nonlinearly transform branch features into a concatenation-ready space. |
collect concat | Append branch information to the main projected vision tokens. |
Why this is a nice systems pattern
The branch is still described declaratively in the template. CKE does not hard-code a one-off multimodal hack somewhere deep in generated C. It states the tap, the producer ops, and the collection rule in the architecture file itself.
Footer — From Vision Tokens to Language Space
After 27 layers of bidirectional visual reasoning, the encoder still holds patch-grid features in its own vision width. The footer is the conversion zone. It takes those features and turns them into the final vision_embeddings buffer consumed by the multimodal bridge.
The footer sequence is: final_norm → merge_main → projector_fc1 → projector_gelu → projector_fc2 → deepstack_concat. In words: normalize, spatially compress, project toward the decoder embedding space, and then concatenate the DeepStack branch features.

void spatial_merge_2x2(const float *input, float *output, int grid_h, int grid_w, int embed_dim)
{
const int out_h = grid_h / 2;
const int out_w = grid_w / 2;
for (int oh = 0; oh < out_h; ++oh) {
for (int ow = 0; ow < out_w; ++ow) {
float *dst = output + ((size_t)oh * out_w + ow) * embed_dim * 4;
// Pack: [top-left, top-right, bottom-left, bottom-right]
const float *tl = input + ((size_t)(2*oh) * grid_w + 2*ow) * embed_dim;
const float *tr = input + ((size_t)(2*oh) * grid_w + 2*ow + 1) * embed_dim;
const float *bl = input + ((size_t)(2*oh+1) * grid_w + 2*ow) * embed_dim;
const float *br = input + ((size_t)(2*oh+1) * grid_w + 2*ow + 1) * embed_dim;
memcpy(dst, tl, embed_dim * sizeof(float));
memcpy(dst + embed_dim, tr, embed_dim * sizeof(float));
memcpy(dst + 2*embed_dim, bl, embed_dim * sizeof(float));
memcpy(dst + 3*embed_dim, br, embed_dim * sizeof(float));
}
}
}That packing order matters. Spatial merge does not average four neighboring patches into one smaller vector. It concatenates their embeddings in a fixed order: top-left, top-right, bottom-left, bottom-right. So the number of tokens drops by 4, while the local feature width grows by 4 before the projector MLP reshapes it again.
For a 32 × 32 grid, spatial merge turns 1024 patch tokens into 16 × 16 = 256 merged tokens. The projector then learns how to compress those wider local bundles into the decoder-facing representation.
| Footer op | Main effect |
|---|---|
final_norm | Normalize the final per-patch features. |
merge_main | Reduce token count by packing 2×2 neighborhoods. |
projector_fc1 → projector_gelu → projector_fc2 | Project merged visual features into the multimodal bridge space. |
deepstack_concat | Append branch features to form final vision_embeddings. |
The footer is the adapter from “vision transformer hidden state” to “language model prefix embedding.” The decoder does not want raw patch-grid internals. It wants a compact sequence of embeddings in its own multimodal interface space.
The Multimodal Bridge
At this point the vision encoder has emitted vision_embeddings. Those are not end-user words. They are already vector embeddings. The bridge logic in version/v8/scripts/vision_bridge_runtime_v8.py decides how they are handed to the text decoder.
The key parameters named in the prompt are prefix_tokens, embed_dim, and projector_total_out_dim. The bridge resolves how many visual tokens exist after merging and projection, what their width is, and where they are written inside the decoder's input embedding buffer.
// Decoder embedded input buffer
float decoder_input[T_total][4096];
// Prefix region comes from vision encoder output
copy_vision_prefix(
decoder_input,
vision_embeddings,
prefix_tokens,
projector_total_out_dim);
// Text token embeddings are appended after the prefix
copy_text_embeddings(
decoder_input + prefix_tokens,
text_token_embeddings,
text_token_count);
// Decoder sees one unified sequence:
// [vision_tokens... | text_tokens...]This is the multimodal trick in its cleanest form. The decoder does not need one attention mechanism for images and another for text. It just receives a single sequence of embeddings in which the first chunk came from the vision encoder. That is why the bridge matters so much. It translates between the vision template's output contract and the decoder template's input contract.
The chat template in version/v8/templates/qwen3vl.json also exposes the interface at the text level using the markers <|vision_start|> and <|vision_end|>. Those markers are the symbolic shell around the visual prefix region.
Why the bridge is elegant
The encoder can stay fully bidirectional and image-specific. The decoder can stay causal and text-centric. The bridge is the handshake that lets each side keep its native contract.
The Text Decoder — What Changes for Multimodal
The decoder side is still a transformer language model. The generated file build/regression-runs/qwen3vl/multimodal_bridge/decoder/decoder_v8.c contains 36 unrolled layers with EMBED_DIM 4096 and NUM_HEADS 32. Its attention is causal for text generation. Its MLP uses the familiar Qwen-style gate_up → silu_mul → down SwiGLU pattern from the text template.
What changes in multimodal mode is not the decoder's identity. What changes is its input sequence. Instead of beginning with only learned token embeddings for text, it begins with a block of vision-derived embeddings that have already been projected into its embedding space.
| Aspect | Vision encoder | Text decoder |
|---|---|---|
| Layers | 27 | 36 |
| Embedding width | 1152 internal | 4096 |
| RoPE mode | 2D vision M-RoPE | 1D text M-RoPE |
| RoPE theta | 10000.0 | 5000000.0 |
| Attention visibility | Bidirectional | Causal over the running sequence |
The subtle but important decoder behavior is this: text tokens cannot look into their own future, but they can attend backward to the vision prefix because that prefix is already part of the past from the decoder's perspective. So multimodal generation is still causal, just with a richer past.
This is where the series-wide connection to Gated DeltaNet becomes helpful. DeltaNet showed that modern stacks can experiment with different memory contracts around the transformer core. Qwen3-VL's decoder remains dense causal attention, but it plugs into a vision prefix produced by a separate bidirectional encoder. Same broad systems theme: different stages can use different memory geometries if the interface is clean.
The decoder is still “just” a language model—except its past now includes image-derived embeddings, so language reasoning can condition on visual context without a special second decoder.SigLIP vs Qwen3-VL Vision — Two Approaches
The prompt gives us a useful comparison point in version/v8/templates/siglip_vit.json. SigLIP ViT uses the same general header/body/footer template pattern, but the design is simpler. It uses an im2patch frontend, a single projection path, RMSNorm, and GeGLU. Qwen3-VL uses dual patch projection, LayerNorm, GELU, 2D M-RoPE, and DeepStack branches.
That comparison is valuable because it shows the template language is not specific to one architecture. CKE v8 can describe both a cleaner SigLIP-style ViT and the richer Qwen3-VL vision frontend with branches. Same structural grammar. Different model family decisions.
| Feature | SigLIP ViT | Qwen3-VL vision |
|---|---|---|
| Template pattern | Header → body → footer | Header → body → footer |
| Patch frontend | Single im2patch style projection | Dual patch projection with summed auxiliary stream |
| Normalization | RMSNorm | LayerNorm |
| Activation | GeGLU | GELU |
| Positional mechanism | Simpler ViT-style handling | 2D M-RoPE with four streams [y,x,y,x] |
| Branches | No DeepStack branch in the prompt description | Fixed DeepStack branch with concat collection |
SigLIP shows the base template idea. Qwen3-VL shows how far that same template idea can be pushed when you want richer spatial handling and multi-depth feature export. The architecture language stays stable while the blocks get more ambitious.
Series callback
LayerNorm And RMSNorm contrasted LayerNorm and RMSNorm as systems choices, not merely formula choices. This comparison makes that point concrete again. Different models choose different normalization contracts even when the high-level transformer skeleton is similar.
Memory and Compute Profile
Let us make the buffer story concrete. The raw image is [3, 448, 448] or 602,112 floats. Patchify preserves that total float count but reorganizes it into [1024, 588]. After either projection path, the header is working with [1024, 1152] activations, which is 1,179,648 floats.
At FP32, one such activation matrix is roughly 1,179,648 × 4 ≈ 4.5 MB. That is only one stage. Internally, each layer also creates Q, K, and V views across 16 heads of width 72. Because the encoder is bidirectional and non-autoregressive, those attention intermediates are ephemeral rather than stored as a growing decode cache.

| Stage | Shape | Float count | Approx FP32 bytes |
|---|---|---|---|
| Image | [3, 448, 448] | 602,112 | ~2.3 MB |
| Patches | [1024, 588] | 602,112 | ~2.3 MB |
| Projected patch stream | [1024, 1152] | 1,179,648 | ~4.5 MB |
| After 2×2 merge | [256, 4×1152] before projector reshaping | 1,179,648 packed features | same order, different layout |
| Decoder prefix | [prefix_tokens, projector_total_out_dim] | runtime-specific | bridge-managed |
The provided generated encoder weight bundle is about 752 MB. That is the parameter story. The activation story is smaller but still nontrivial because the encoder moves fairly wide patch-token matrices through 27 layers.
The encouraging systems fact is that spatial merge reduces the sequence length before the decoder sees anything. Instead of handing 1024 raw patch tokens to the language model, the footer compresses the representation into a much shorter visual prefix.
752 MB The prompt's generated encoder artifact size highlights that modern multimodal quality is not just about clever kernels; it is also about carrying a very large learned visual parameter bank.No autoregressive KV cache on the vision side
This is worth stating explicitly. The encoder does attention over the full image in one pass. It does not need the growing, persistent KV-cache that dominates long-context decoder inference. That is one reason the visual prefix can be expensive up front but still clean to hand off afterward.
Conclusion — Everything Connects
The biggest takeaway from Qwen3-VL in C-Kernel-Engine v8 is not merely that a strong open model uses patchify, RoPE, and attention. It is that the entire path is declared in a structure we can read: header, body, footer.
Header turns pixels into positioned patch tokens. Body applies the same transformer ideas we derived earlier in the series: LayerNorm, packed QKV projection, attention, output projection, GELU MLP, and residual addition. Footer compresses and projects those features into a decoder-facing embedding sequence. Then the multimodal bridge prepends that sequence to text embeddings so one causal decoder can reason over both image and language.
The header/body/footer pattern we have been building toward in this series is not an analogy. In C-Kernel-Engine v8, it is the actual architecture template that generates the Qwen3-VL vision encoder. That is why this post matters: it shows theory meeting software architecture line by line.
attention is here. GELU and activation functions are here. normalization is here. residual connections are here. training and backprop logic is what made these weights learnable in the first place. the optimizer story is what moved those parameters into a useful regime. And Gated DeltaNet reminds us that attention is not the only memory design in modern model systems, even if this specific vision encoder is a pure ViT.
When you can point at a real template, a real kernel, and a real generated C file, “ML fundamentals” stops being abstract. It becomes inspectable engineering.Further reading and follow-up
C-Kernel-Engine documentation: https://c-kernel-engine.github.io/C-Kernel-Engine/
If you want the narrated version of this ML fundamentals series, follow ShivasNotes on YouTube at youtube.com/@shivasnotes.
This post is the bridge from isolated ideas to a real multimodal stack. The next time you hear “vision-language model,” you should be able to picture the exact sequence: patchify, project, position, attend, branch, merge, project again, then prepend to text.