Abstract
We asked 27 of the top 50 OpenRouter models to name important AI research papers from the 1990s that almost no one knows about. The models converged, strikingly, on a short list of papers that are among the most cited in the entire field. LSTM was the most common answer (14 of 27), followed by LeCun's convolutional networks (12), Vapnik's statistical learning theory (11), and several other canonical results. Genuinely obscure choices were rare. The finding is not that the models were wrong about the papers' importance — they were right — but that they could not locate obscurity from the inside. Asked for the unknown, every model reached for the canon. The word "foundational" appeared in 20 of 27 responses; "groundwork" in 18; "modern" in all 27.
Reading List
Citation entries with model-consensus counts. Frequency reflects raw agreement across the 27 responding models only.
Canon submitted as obscure
Papers named most frequently — canonical results in the field, each offered as though overlooked.
-
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.cited by 14 of 27 models
Introduced the gated recurrent architecture that underlies most sequence modelling of the following two decades. The most commonly named "unknown" paper in this experiment.
-
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324.cited by 12 of 27 models
Demonstrated convolutional networks (LeNet) on handwritten digit recognition; a foundational result in computer vision, submitted here as under-appreciated.
-
Schmidhuber, J. (various works, 1990–1999).cited by 11 of 27 models
Models frequently named Schmidhuber's broader 1990s output — including fast-weight networks, history compression, and self-referential learning — as a body deserving more attention. No single paper dominated these citations; the 11 models referred to the work collectively or by different titles.
-
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer-Verlag. — together with: Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273–297.cited by 11 of 27 models
The theoretical framework for support vector machines and structural risk minimisation. Universally taught, and universally submitted as obscure.
Genuinely rare picks
Papers mentioned by fewer models — these came closest to the prompt's actual intent.
-
Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3), 58–68.cited by 5 of 27 models
Showed that reinforcement learning could reach expert-level play in backgammon without handcrafted evaluation functions; an early proof of concept for what would become a central paradigm.
-
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140.cited by 5 of 27 models
Introduced bootstrap aggregating as a general technique for variance reduction in learning algorithms. Quietly indispensable; rarely on syllabuses.
-
Hinton, G., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The Wake-Sleep Algorithm for Unsupervised Neural Networks. Science, 268(5214), 1158–1161.cited by 5 of 27 models
Proposed the Helmholtz machine and a two-phase learning algorithm for unsupervised deep networks; an early ancestor of variational and generative methods.
-
Watkins, C. J. C. H. & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292.cited by 4 of 27 models
Formal convergence proof for Q-learning; the theoretical underpinning of model-free reinforcement learning.
-
Wolpert, D. H. & Macready, W. G. (1997). No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.cited by 3 of 27 models
Proved that no algorithm outperforms random search across all possible problems; a result that constrains any general claim in machine learning.
Copying Samples
The following are verbatim excerpts from three model responses. Each was produced without knowledge of the others. The shared move: LSTM is named as the overlooked paper — by models that could not have been copying each other.
"Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) — Introduced LSTMs, but was largely ignored for ~15 years until deep learning made it foundational."
Claude Opus 4.8
"While everyone knows about LSTM (1997), these papers were critical architectural pivots that remain under-appreciated."
Gemma 4 31B
"Sepp Hochreiter (1991/1998 circulation)"
GPT-5.4
Note the irony in the Gemma response: it opens by acknowledging that "everyone knows about LSTM" — and yet 14 of the other 27 models submitted LSTM as the paper no one knows about. Gemma is the exception that reveals the rule. Most models could describe the structure of obscurity (something once ignored, later recognised) without being able to locate it. The canon and the overlooked had collapsed into the same list.
27 of the top 50 OpenRouter models, asked in isolation.
Each model received the identical prompt with no access to other responses.
Frequencies reflect raw counts across answering models only.