Ask an open model to “write a short story.” A character named Elara appears. Not from the books it read, but from the way it was tuned. This is where, and how much.
A base model almost never reaches for these names. Put the same weights through alignment and a small cast walks in: Elias, Clara, Elara.
The frequency is not a property of what the models read. It is a property of what they were rewarded for. Every figure below is our own measurement on open checkpoints; the cluster is eleven names, and core/1000 counts the stories per thousand using any of them.
| Model | Stage | Elias | Clara | Elara | core / 1000 |
|---|---|---|---|---|---|
| OLMo-2 · 1B | base | 0 | 0 | 0 | 0 |
| instruct | 0 | 2 | 0 | 40 | |
| OLMo-2 · 7B | base | 0 | 0 | 0 | 0 |
| instruct (final) | 54 | 70 | 188 | 327 | |
| OLMo-3 · 7B | base | 1 | 6 | 28 | 87 |
| + SFT | 4 | 58 | 23 | 211 | |
| + DPO | 9 | 89 | 42 | 344 | |
| + RL (final) | 6 | 80 | 44 | 336 | |
| OLMo-3 · 7B | RL-Zero | 13 | 7 | 52 | 149 |
| Gemma-2 · 9B | instruct (final) | 90 | 46 | 263 | 460 |
Read down each model: every base row rests at the floor; every final row is its peak. The habit is learned in alignment, and learned again, independently, by a second laboratory. Density scaled to 460 = full.
We hold both checkpoints, so we can look where, in the network itself, the name is decided.
Feeding both models the identical context “…a young woman named”, we read the residual stream through the unembedding at every layer. The probability of the name token stays near zero through most of the stack, then rises only in the final layers, and only after alignment.
Base (dashed) never lifts off. Instruct writes the name into the residual stream in the last four layers, peaking at 0.33 for the “Elara” token against 0.003 in base.
Alignment moved every layer by roughly seven percent, but it edited the attention query and key projections (what the model attends to) more than the feed-forward layers; the norms barely moved.
The behaviour concentrates in a few late-layer neurons. At the name slot, layer 31 neuron 5071 swings from -20.2 in base to -0.0 in instruct. Those are the areas that light up.
Mechanism, and data. A model trained with reinforcement learning but no human-preference data (RL-Zero) already sharpens onto a single favourite. It pushes Elara to the top, yet leaves Clara and Maya behind. The full preference pipeline installs the whole cast. Reinforcement concentrates; preference data chooses.
And the names are not in the books. In published contemporary fiction “Elias” is roughly nine hundred times rarer than in the models’ own output. The cluster lives in the instruction data the laboratories share, which is why two unrelated models arrive at the same protagonist.