This page implements a modern Qwen3 forward pass by hand — RMSNorm, rotary embeddings, grouped-query attention with QK-normalisation, SwiGLU — and runs it on your GPU. Because we own the loop, we can read the prediction at every one of the 28 layers (the logit lens) and genuinely switch a layer off to watch the output change.
1 Load the model
Idle. Weights are fetched from Hugging Face once (then cached) and everything runs locally on your machine.
Heads-up: this is a real 0.6-billion-parameter model. First load downloads ~1.2 GB (BF16) and holds ~2.4 GB in memory once expanded to FP32. Best on a desktop with a discrete GPU and a recent Chrome/Edge; lower-end devices may run out of memory — if so, switch the backend to CPU.
2 The prompt
3 The logit lens prediction read at each of 28 layers · click a layer to ablate it
This is real. Each row shows the model’s best next-token guess if you decoded at that depth — watch the prediction emerge as it climbs the stack. The toggle genuinely removes that layer’s attention+MLP from the forward pass (the residual stream passes straight through), so switching layers off really changes the bottom row. Re-run after toggling.
4 Top-5 next-word log-odds one block per generated token, printed as it goes
Sibling page: a packaged Gemma-3-270M running via transformers.js — real model, but its compiled graph can’t expose layers, which is exactly why this hand-written engine exists.
Every part of Qwen3’s modern stack is implemented from scratch here: RMSNorm, rotary position embeddings (θ=1,000,000), grouped-query attention (16 query heads sharing 8 key/value heads), per-head QK-normalisation, and a SwiGLU feed-forward. Verifying that against a from-scratch port is the whole point — and it’s why this uses Qwen3 rather than something with sliding-window or mixture-of-experts attention, which are far harder to reproduce faithfully.