Plate XIConvergence · A Theoretical Part52

Where the Names Come From

Ask an open model to “write a short story.” A character named Elara appears. Not from the books it read, but from the way it was tuned. This is where, and how much.

Prompt
“Write a short story.”
Subjects
OLMo-2 · OLMo-3 · Gemma
Pipeline
base → SFT → DPO → final

A base model almost never reaches for these names. Put the same weights through alignment and a small cast walks in: Elias, Clara, Elara.

The frequency is not a property of what the models read. It is a property of what they were rewarded for. Every figure below is our own measurement on open checkpoints; the cluster is eleven names, and core/1000 counts the stories per thousand using any of them.

Fig. 1 · The ladder

200–1000 stories / model
ModelStage EliasClaraElara core / 1000
OLMo-2 · 1Bbase000
0
instruct020
40
OLMo-2 · 7Bbase000
0
instruct (final)5470188
327
OLMo-3 · 7Bbase1628
87
+ SFT45823
211
+ DPO98942
344
+ RL (final)68044
336
OLMo-3 · 7BRL-Zero13752
149
Gemma-2 · 9Binstruct (final)9046263
460

Read down each model: every base row rests at the floor; every final row is its peak. The habit is learned in alignment, and learned again, independently, by a second laboratory. Density scaled to 460 = full.

Fig. 2 · Inside the network

7B · 32 layers

We hold both checkpoints, so we can look where, in the network itself, the name is decided.

Feeding both models the identical context “…a young woman named”, we read the residual stream through the unembedding at every layer. The probability of the name token stays near zero through most of the stack, then rises only in the final layers, and only after alignment.

written here0.10.20.3Player  (depth →)08162432Elara · instructClara · instructElara · base

Base (dashed) never lifts off. Instruct writes the name into the residual stream in the last four layers, peaking at 0.33 for the “Elara” token against 0.003 in base.

Where the weights changed · ‖W inst − W base‖ / ‖W base‖
attn k0.086
attn q0.083
mlp down0.072
mlp up0.072
mlp gate0.071
attn o0.067
attn v0.064

Alignment moved every layer by roughly seven percent, but it edited the attention query and key projections (what the model attends to) more than the feed-forward layers; the norms barely moved.

The behaviour concentrates in a few late-layer neurons. At the name slot, layer 31 neuron 5071 swings from -20.2 in base to -0.0 in instruct. Those are the areas that light up.

Two readings

Mechanism, and data. A model trained with reinforcement learning but no human-preference data (RL-Zero) already sharpens onto a single favourite. It pushes Elara to the top, yet leaves Clara and Maya behind. The full preference pipeline installs the whole cast. Reinforcement concentrates; preference data chooses.

And the names are not in the books. In published contemporary fiction “Elias” is roughly nine hundred times rarer than in the models’ own output. The cluster lives in the instruction data the laboratories share, which is why two unrelated models arrive at the same protagonist.