A transformer,
layer by layer.

This page implements a modern Qwen3 forward pass by hand — RMSNorm, rotary embeddings, grouped-query attention with QK-normalisation, SwiGLU — and runs it on your GPU. Because we own the loop, we can read the prediction at every one of the 28 layers (the logit lens) and genuinely switch a layer off to watch the output change.

1 Load the model

backend

Idle. Weights are fetched from Hugging Face once (then cached) and everything runs locally on your machine.

Heads-up: this is a real 0.6-billion-parameter model. First load downloads ~1.2 GB (BF16) and holds ~2.4 GB in memory once expanded to FP32. Best on a desktop with a discrete GPU and a recent Chrome/Edge; lower-end devices may run out of memory — if so, switch the backend to CPU.

2 The prompt

generate 8 tokens

4 Top-5 next-word log-odds one block per generated token, printed as it goes

Load the model and run a forward pass.

5 Notes & references

Weights: Qwen/Qwen3-0.6B-Base (Apache-2.0, BF16 safetensors, fetched & parsed in-browser) · tokenizer via transformers.js · maths on TensorFlow.js (WebGPU backend).
The logit lens: Interpreting GPT: the logit lens (nostalgebraist, 2020).
Sibling page: a packaged Gemma-3-270M running via transformers.js — real model, but its compiled graph can’t expose layers, which is exactly why this hand-written engine exists.

Every part of Qwen3’s modern stack is implemented from scratch here: RMSNorm, rotary position embeddings (θ=1,000,000), grouped-query attention (16 query heads sharing 8 key/value heads), per-head QK-normalisation, and a SwiGLU feed-forward. Verifying that against a from-scratch port is the whole point — and it’s why this uses Qwen3 rather than something with sliding-window or mixture-of-experts attention, which are far harder to reproduce faithfully.

1 Load the model

2 The prompt

3 The logit lens prediction read at each of 28 layers · click a layer to ablate it

4 Top-5 next-word log-odds one block per generated token, printed as it goes

5 Notes & references