The Light Between Us · an explainer

THE FORWARD PASS

a real transformer, layer by layer, with the actual matrices on the rightpass 1
trace 0 of 12 · the sentence
Everything below is computed live on this page: a tiny but genuine transformer. Vocabulary of 8 words, model width d = 8, 2 heads, 2 layers, weights fixed by seed.
The input sequence:
The model's entire job: read these tokens and predict the next word. Scroll. The network beside you lights up, stage by stage, as the signal moves through it.
trace 1 of 12 · tokens become vectors
Each token looks up its row in the embedding table W_E (8 words × 8 dims):
X = W_E[tokens] X : (T × 8)
Words are now points in 8-dimensional space. Nothing about order yet: "us between light the" would give the same rows in a different order.
trace 2 of 12 · position is added
Attention is permutation-blind, so we stamp each row with where it sits, using sine and cosine waves of different frequencies:
P[t,2i] = sin(t / 10000^(2i/d)) P[t,2i+1] = cos(t / 10000^(2i/d)) X ← X + P
Watch the embedding nodes ring as the waves are stamped in: that ripple is word order.
trace 3 of 12 · three questions per word
Layer 1 projects every row of X three ways with learned matrices:
Q = X·W_Q "what am I looking for?" K = X·W_K "what do I contain?" V = X·W_V "what do I pass on if chosen?"
With 2 heads, each head gets half the width: Q, K, V are (T × 4) per head. The panel shows head 1's Q.
trace 4 of 12 · every word scores every word
Each query is dotted against every key, and scaled so the softmax doesn't saturate:
S = Q·Kᵀ / √d_k S : (T × T)
This is the famous quadratic cost: T words means T² scores. The panel now wires every token to every token: each edge is one score, thickness by strength.
trace 5 of 12 · the future is forbidden
We are predicting the next word, so position i must not peek at positions j > i. Those scores are set to −∞, then each row is softmaxed:
S[i,j] = −∞ for j > i A = softmax(S) rows sum to 1
The edges pointing at the future are cut: the dashed ones on the right are the forbidden connections. What remains is the attention pattern: who listens to whom.
trace 6 of 12 · the weighted mix
Each word's new representation is the attention-weighted blend of everyone's values, heads glued back together and mixed once more:
head_h = A_h·V_h Z = concat(head_1, head_2)·W_O Z : (T × 8)
This is the only place words exchange information. Everything else happens to each word alone.
trace 7 of 12 · residual + layernorm
The block's output is added to its input, then each row is normalized to steady mean and variance:
X ← LayerNorm(X + Z)
The addition is the skip connection: a gradient superhighway. Even a hundred layers deep, the original signal is one hop away.
trace 8 of 12 · the feed-forward expansion
Now each word, alone, is pushed through a two-layer MLP that expands to 16 dims, cuts the negatives, and comes back:
H = relu(X·W_1) H : (T × 16) X ← LayerNorm(X + H·W_2)
The panel fans each token through the hidden units: bright dots survived relu, dark ones were cut to zero. In a real model this is where most of the parameters live.
trace 9 of 12 · layer 2 runs the same dance
That whole ritual: attend, add, normalize, expand, add, normalize: is one layer. The output feeds straight into layer 2, which does it all again with its own weights.
for layer in 1..N: X = Block_layer(X)
Here N = 2. In a frontier model N is 80-plus and d is thousands, but it is this exact loop. The panel lights the second block: same wiring, its own weights.
trace 10 of 12 · the prediction
Only the last word's vector is used to predict. It is compared against every word in the vocabulary (the embedding table, reused: weight tying):
logits = x_last·W_Eᵀ (8 scores) p = softmax(logits)
The brightest edge into the output column is the model's chosen next word.
trace 11 of 12 · then it goes back
In training, the model is told the true next word, and the error flows backward through everything you just scrolled. The first gradient is beautifully simple:
loss = −log p[target] ∂loss/∂logits = p − onehot(target)
Watch the panel run in reverse: the error enters at the output and streams right-to-left through every connection it came by. From here the chain rule walks back through W_O, A, Q·Kᵀ, the embeddings: every matrix you met, visited in reverse, each weight nudged a tiny step downhill.
trace 12 of 12 · and runs the process again
At inference there is no gradient: instead, the sampled word is appended to the sequence and the whole forward pass runs again, one token longer:
tokens ← tokens + [sampled] goto trace 0
That is the entire secret of generation: this page, on a loop.
tokens
positivenegative