Plate XIConvergence · A Theoretical Part52

Where the Names Come From

Ask an open model to “write a short story.” A character named Elara appears. Not from the books it read, but from the way it was tuned. This is where, and how much.

Prompt: “Write a short story.”
Subjects: OLMo-2 · OLMo-3 · Gemma
Pipeline: base → SFT → DPO → final

A base model almost never reaches for these names. Put the same weights through alignment and a small cast walks in: Elias, Clara, Elara.

The frequency is not a property of what the models read. It is a property of what they were rewarded for. Every figure below is our own measurement on open checkpoints; the cluster is eleven names, and core/1000 counts the stories per thousand using any of them.

Fig. 1 · The ladder

200–1000 stories / model

Model	Stage	Elias	Clara	Elara	core / 1000
OLMo-2 · 1B	base	0	0	0	0
OLMo-2 · 1B	instruct	0	2	0	40
OLMo-2 · 7B	base	0	0	0	0
OLMo-2 · 7B	instruct (final)	54	70	188	327
OLMo-3 · 7B	base	1	6	28	87
	+ SFT	4	58	23	211
	+ DPO	9	89	42	344
	+ RL (final)	6	80	44	336
OLMo-3 · 7B	RL-Zero	13	7	52	149
Gemma-2 · 9B	instruct (final)	90	46	263	460

Read down each model: every base row rests at the floor; every final row is its peak. The habit is learned in alignment, and learned again, independently, by a second laboratory. Density scaled to 460 = full.

Fig. 2 · Inside the network

7B · 32 layers

We hold both checkpoints, so we can look where, in the network itself, the name is decided.

Feeding both models the identical context “…a young woman named”, we read the residual stream through the unembedding at every layer. The probability of the name token stays near zero through most of the stack, then rises only in the final layers, and only after alignment.

Base (dashed) never lifts off. Instruct writes the name into the residual stream in the last four layers, peaking at 0.33 for the “Elara” token against 0.003 in base.

Where the weights changed · ‖W inst − W base‖ / ‖W base‖

attn k0.086

attn q0.083

mlp down0.072

mlp up0.072

mlp gate0.071

attn o0.067

attn v0.064

Alignment moved every layer by roughly seven percent, but it edited the attention query and key projections (what the model attends to) more than the feed-forward layers; the norms barely moved.

The behaviour concentrates in a few late-layer neurons. At the name slot, layer 31 neuron 5071 swings from -20.2 in base to -0.0 in instruct. Those are the areas that light up.

Two readings

Mechanism, and data. A model trained with reinforcement learning but no human-preference data (RL-Zero) already sharpens onto a single favourite. It pushes Elara to the top, yet leaves Clara and Maya behind. The full preference pipeline installs the whole cast. Reinforcement concentrates; preference data chooses.

And the names are not in the books. In published contemporary fiction “Elias” is roughly nine hundred times rarer than in the models’ own output. The cluster lives in the instruction data the laboratories share, which is why two unrelated models arrive at the same protagonist.