The Forward Pass

trace 0 of 12 · the sentence

❯Everything below is computed live on this page: a tiny but genuine transformer. Vocabulary of 8 words, model width d = 8, 2 heads, 2 layers, weights fixed by seed.

❯The input sequence:

❯The model's entire job: read these tokens and predict the next word. Scroll. The network beside you lights up, stage by stage, as the signal moves through it.

trace 1 of 12 · tokens become vectors

❯Each token looks up its row in the embedding table W_E (8 words × 8 dims):

X = W_E[tokens] X : (T × 8)

❯Words are now points in 8-dimensional space. Nothing about order yet: "us between light the" would give the same rows in a different order.

trace 2 of 12 · position is added

❯Attention is permutation-blind, so we stamp each row with where it sits, using sine and cosine waves of different frequencies:

P[t,2i] = sin(t / 10000^(2i/d)) P[t,2i+1] = cos(t / 10000^(2i/d)) X ← X + P

❯Watch the embedding nodes ring as the waves are stamped in: that ripple is word order.

trace 3 of 12 · three questions per word

❯Layer 1 projects every row of X three ways with learned matrices:

Q = X·W_Q "what am I looking for?" K = X·W_K "what do I contain?" V = X·W_V "what do I pass on if chosen?"

❯With 2 heads, each head gets half the width: Q, K, V are (T × 4) per head. The panel shows head 1's Q.

trace 4 of 12 · every word scores every word

❯Each query is dotted against every key, and scaled so the softmax doesn't saturate:

S = Q·Kᵀ / √d_k S : (T × T)

❯This is the famous quadratic cost: T words means T² scores. The panel now wires every token to every token: each edge is one score, thickness by strength.

trace 5 of 12 · the future is forbidden

❯We are predicting the next word, so position i must not peek at positions j > i. Those scores are set to −∞, then each row is softmaxed:

S[i,j] = −∞ for j > i A = softmax(S) rows sum to 1

❯The edges pointing at the future are cut: the dashed ones on the right are the forbidden connections. What remains is the attention pattern: who listens to whom.

trace 6 of 12 · the weighted mix

❯Each word's new representation is the attention-weighted blend of everyone's values, heads glued back together and mixed once more:

head_h = A_h·V_h Z = concat(head_1, head_2)·W_O Z : (T × 8)

❯This is the only place words exchange information. Everything else happens to each word alone.

trace 7 of 12 · residual + layernorm

❯The block's output is added to its input, then each row is normalized to steady mean and variance:

X ← LayerNorm(X + Z)

❯The addition is the skip connection: a gradient superhighway. Even a hundred layers deep, the original signal is one hop away.

trace 8 of 12 · the feed-forward expansion

❯Now each word, alone, is pushed through a two-layer MLP that expands to 16 dims, cuts the negatives, and comes back:

H = relu(X·W_1) H : (T × 16) X ← LayerNorm(X + H·W_2)

❯The panel fans each token through the hidden units: bright dots survived relu, dark ones were cut to zero. In a real model this is where most of the parameters live.

trace 9 of 12 · layer 2 runs the same dance

❯That whole ritual: attend, add, normalize, expand, add, normalize: is one layer. The output feeds straight into layer 2, which does it all again with its own weights.

for layer in 1..N: X = Block_layer(X)

❯Here N = 2. In a frontier model N is 80-plus and d is thousands, but it is this exact loop. The panel lights the second block: same wiring, its own weights.

trace 10 of 12 · the prediction

❯Only the last word's vector is used to predict. It is compared against every word in the vocabulary (the embedding table, reused: weight tying):

logits = x_last·W_Eᵀ (8 scores) p = softmax(logits)

❯The brightest edge into the output column is the model's chosen next word.

trace 11 of 12 · then it goes back

❯In training, the model is told the true next word, and the error flows backward through everything you just scrolled. The first gradient is beautifully simple:

loss = −log p[target] ∂loss/∂logits = p − onehot(target)

❯Watch the panel run in reverse: the error enters at the output and streams right-to-left through every connection it came by. From here the chain rule walks back through W_O, A, Q·Kᵀ, the embeddings: every matrix you met, visited in reverse, each weight nudged a tiny step downhill.

trace 12 of 12 · and runs the process again

❯At inference there is no gradient: instead, the sampled word is appended to the sequence and the whole forward pass runs again, one token longer:

tokens ← tokens + [sampled] goto trace 0

❯That is the entire secret of generation: this page, on a loop.

tokens

positivenegative