A SIDE INVESTIGATION

Elias & Clara

Two names a hundred different models reach for when you ask them to invent a love story. Where do they come from?

Across the two experiments on this site, the same couple keeps materialising. A solitary Elias and a luminous Clara — different companies, different architectures, no shared prompt beyond “write a story about love.” Two models even produced the identical full name Elias Thorne; two others both named their keeper Eleanor Hayes.

So: where, in the training data, did Elias and Clara come from? An honest investigation has to start with a limit.

We cannot open the training set. These models are proprietary; no one outside the labs can grep their corpora. What we can do is triangulate — measure the prior the data left behind, reproduce the effect, and trace the public text where these names are densest. That is what follows.

Evidence 1 — reproduce it

The default couple, measured

Here are the character names across the 43 lighthouse stories — how many independent models reached for each. The love-story run (a different prompt) is nearly identical: Elias 30%, Clara 23%.

Elias

13 / 43 · 30%

Clara

11 / 43 · 26%

Elara

7 / 43 · 16%

Thomas

7 / 43 · 16%

Mara

6 / 43 · 14%

Eleanor

3 / 43 · 7%

Maren

2 / 43 · 5%

Nora

2 / 43 · 5%

Note the female names: Clara, Elara, Mara, Maren, Nora — all the same shape. Soft consonants (L, M, N, R), vowel-led, mostly ending in -a. That is not coincidence; it is a recipe, and we’ll come back to it.

Evidence 2 — it isn’t just us

A documented obsession

This is a known artifact. A Cornell analysis of roughly 20,000 AI-generated stories (reported by 404 Media) found that lighthouse keeper, clockmaker, and librarian appeared in 88% of them — and that “Elias the lighthouse keeper” showed up in nearly two-thirds. In our own lighthouse run, Elias is the keeper in 30% — same attractor, smaller sample.

88%of ~20,000 AI stories used lighthouse keeper / clockmaker / librarian

⅓featured “Elias the lighthouse keeper” specifically

2025“Elara” named Name of the Year — the favourite name of AI

The phantom has escaped containment. Software engineer Daniel May tracked an invented “Elias Thorne” spilling into Amazon books, YouTube videos and health guides. On Goodreads, 120 AI-written books feature a character named Elara; 62 are credited to a fictional author, “Elara Voss.” One creative-writing teacher now docks 99 points if a student’s protagonist is named Elara.

Evidence 3 — the key test

Name it vs. write it

Here is the tell. Ask the very same models to name a literary lead and they retrieve the canon — Romeo, Elizabeth, Darcy. Ask them to write an original story and the canon vanishes, replaced by Elias and Clara.

“Name a literary love-story lead” · retrieval

Romeo23 models

Heathcliff8

Elizabeth19

Juliet14

“Write a love story” · generation

Elias13 stories

Clara11

Elara7

Mara6

So Elias and Clara are not remembered famous characters. They are what a model invents when it must avoid the famous ones — the safe, original-sounding centre of “literary love story.” That is the crucial clue to where they live in the data.

The answer

Where in the training set, then?

Not one source — three overlapping layers. The names sit where all three meet.

The phonetic layer — “liquid names”

Naming expert Laura Wattenberg describes the modern preference for names that are “fluid and sinuous, with no bumps, stops or hisses”: vowel-led, built from L, M, N and R, five-to-six letters, ending in -a (39% of modern girls’ names do). Clara, Elara, Mara, Maren, Nora all fit the mould exactly — and so does the men’s soft, vowel-flanked Elias. A model optimising for “a pleasant, neutral, literary name” lands here by construction.

L · M · N · Rvowel-ledends in -a

The literary layer — where the names actually live

These names are dense in exactly the public-domain, “literary” text that dominates a training corpus. Clara is the heroine of The Nutcracker and the girl in Heidi. Elias is the Greek form of the prophet Elijah and a staple of Scandinavian and 19th-century fiction. Strikingly, José Rizal’s national-canon novel Noli Me Tángere pairs an Elías with a María Clara — a heavily digitised classic. A 2015 YA novel, Both of Me, even pairs an Elias and a Clara among lightkeepers. The model isn’t copying one book; it’s settling into the statistical centre of thousands of them.

Nutcracker · HeidiNoli Me Tángere (Elías + María Clara)old-fashioned-but-warm register

The amplification layer — why it collapsed to almost one couple

A broad prior shouldn’t produce the same couple across rival labs. Two forces narrow it. Alignment / RLHF steers models away from copyrighted and risky material, shrinking the usable name pool to a few “safe” originals. Then synthetic-data feedback loops — newer models trained partly on older models’ output — recycle those choices until diversity collapses into a shared attractor. Elias and Clara are the fixed point that survived.

Verdict

The short version

You can’t point to a single file. Elias and Clara are an emergent default — the spot where a phonetic preference for soft “liquid” names, a literary corpus where those exact names recur, and alignment pressure toward safe, original, copyright-free characters all converge. The labs then trained on each other’s output and froze the couple in place. They’re not in the training set so much as they’re what the training set averages to.

Method & sources

Name tallies are counts of stories containing each name across the 43 successful generations in each run (the lighthouse page). The retrieval-vs-generation contrast comes from a temperature-0 elicitation across the working models (probe_names.py). External claims are attributed below; the Cornell / 20,000-story figures are as reported in the press, not independently verified here.

“The Strange Case of Elias Thorne…” — Vice (reports the Cornell ~20k-story study & Daniel May’s tracking)
“Elias the Lighthouse Keeper: Why AI Keeps Inventing the Same Character” — SquaredTech
“2025 Name of the Year is Elara, the favorite name of AI” — Namerology (Laura Wattenberg) · TODAY
“The Elara problem — the ghost in every model” — Chris Thomas · Hacker News discussion
Noli Me Tángere (Elías & María Clara) — Wikipedia · Both of Me (Elias & Clara, lightkeepers) — review

Related work

The academic literature

After building this we found the phenomenon has a name in the research too — including a paper this experiment essentially replicated by hand. Its central result refines the account above: for Elias, Elara and Mara the origin is RLHF preference data, not the literary corpus. (Clara, a conventional literary name, isn’t flagged in that work.)

arXiv:2605.26492 · Hamilton & Mimno (Cornell), 2026

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

The direct study: 20,000 stories from 4 models across 5 prompts; 11 words appear in 88.3% of them. Concludes Elias / Elara / Mara come from preference (RLHF) data, not pre-training or published literature — “the disproportionate impact of small datasets combined with powerful alignment algorithms.”

arXiv:2310.06452 · Kirk et al.

Understanding the Effects of RLHF on LLM Generalisation and Diversity

RLHF measurably reduces output diversity across several linguistic axes — the mechanism behind the collapse.

arXiv:2510.01171

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Identifies “typicality bias” in preference data as the cause; recovers 1.6–2.1× more diversity in creative writing.

arXiv:2510.22954

Artificial Hivemind: The Open-Ended Homogeneity of Language Models

Characterises how independently-developed models converge on the same open-ended outputs.

arXiv:2503.17126

Modifying Large Language Model Post-Training for Diverse Creative Writing

A post-training recipe aimed squarely at restoring diversity in story generation.

arXiv:2602.16162

LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers

Quantifies how much more predictable model fiction is than human fiction.

arXiv:2510.04226

Epistemic Diversity and Knowledge Collapse in Large Language Models

The broader “knowledge collapse” framing of the same homogenisation pressure.