Trained to Be Mid

In 1886, Francis Galton published a paper titled "Regression towards Mediocrity in Hereditary Stature". Tall parents , on average, had children a little shorter than themselves; short parents, children a little taller. The deviation from the mean shrank by a constant fraction each generation — about two-thirds, in his data.

A hundred and thirty-nine years later, in September 2025, a paper appeared on arXiv called Galton's Law of Mediocrity: Why Large Language Models Regress to the Mean and Fail at Creativity in Advertising. The authors fed advertising concepts into major LLMs and watched what came back. The metaphors went first. Then the emotional hooks. Then the visual specifics. What survived was the bare facts. The output came back longer than the input, and used a wider vocabulary — but everything that made the original ad memorable was gone. The distinctiveness regressed toward a prototype the model had built up from everyone, and which therefore belonged to no one in particular.

The title was not chosen with irony. And the effect has been measured well outside the advertising lab.

I want to give the cause of this a name. I'm going to call it The Homogenizer. That's not a technical term — there's no file in the OpenAI repo called homogenizer.py. What I'm calling the homogenizer is a composite: four mechanisms, layered on top of each other, each doing a small thing, that together do exactly what the name suggests. They take a model that, in its raw form, can imitate Cormac McCarthy, or a 4chan post, or a children's book, or a Soviet engineering manual — and squash all of it down into the same hedging, bullet-pointed, slightly-too-helpful voice you've been reading for the last three years.

The same effect has been measured in writing, in research ideation, in customer support work, and in senior engineering on real codebases. We'll get to all of it.

The first time I tried to explain why this happens, I got it wrong. I want to walk you through both the wrong explanation and the right one, because the difference matters.

The wrong explanation

The intuitive story goes like this: language models are trained to predict the next token. The most likely next token is, by definition, the most frequent one. So the model is averaging. Frequency in, average out. Done.

This is wrong, and the way it's wrong is interesting.

Pretraining doesn't optimize for frequency. It optimizes for conditional probability — the most likely next token given the four thousand tokens that came before it. The whole point of conditioning is that the most likely continuation of "Prove that there are infinitely many primes:" is Euclid's proof, which is rare in absolute terms but specific to that context.

There's a more technical reason this matters. The cross-entropy loss used in pretraining is mathematically equivalent to forward KL divergence — a way of measuring how different two probability distributions are, in the form that forces the model to be mass-covering. A perfectly trained base model tries to put probability everywhere humans do, including the tails. Base GPT-3 is, empirically, more diverse than its RLHF descendant. The pretraining stage isn't the homogenizer. If anything, it's the opposite.

So if pretraining isn't doing it, what is?

The right explanation

Here are the four mechanisms that are actually doing the work, in the order they're applied.

1. Preference data with typicality bias. When OpenAI or Anthropic train a reward model, they show human raters two responses and ask which is better. Humans, it turns out, systematically prefer text that feels familiar — typical phrasing, expected structure, predictable rhythm. This is well-documented in cognitive psychology under names like processing fluency and the mere-exposure effect, and the same bias has been demonstrated specifically in RLHF preference data by Zhang et al. (2025). It's not a moral failing of the raters; it's how the human brain works. The reward model learns: typical equals good.

2. RLHF (reinforcement learning from human feedback). The base model is then fine-tuned to maximize the reward model's score. This moves the model away from its diverse pretraining distribution and toward whatever pattern the reward model rewards. Combined with the typicality bias from (1), the pull is toward the typical.

3. KL regularization. During RLHF, a "leash" term keeps the model from drifting too far from the original. But the math of how that leash is implemented — specifically, KL-regularized reinforcement learning with a low coefficient — is provably mode-collapsing in the typical case. There is a 2025 paper titled, with no irony, "KL-Regularized RL is Designed to Mode Collapse." The diversity loss is not an accident or an optimization quirk. It is baked into the objective.

4. Conservative decoding. Even after all of that, the model's output distribution still has variance. But at inference time, decoding parameters (low temperature, top-p sampling) chop off the tails of the distribution and bias the sampler toward higher-probability tokens. By the time the words reach you, the model has been pulled toward the typical at every stage of post-training, and then again at decoding.

None of these four things, alone, would produce strong homogenization. But they all push in the same direction. They compound.

Where the homogenizer sits

The two red-bordered stages are what I'm calling the homogenizer.

The thing to notice is that the homogenizer isn't a component the engineers added. It is a side effect of the components they added for other reasons.

Nobody set out to build it

Nobody built a homogenizer. Each piece exists for an entirely defensible reason. RLHF — Stiennon et al. 2020 for summarization, then InstructGPT (March 2022) and ChatGPT (November 2022) — was introduced because raw language models don't follow instructions and sometimes say wildly inappropriate things. KL regularization was added to keep the model from "reward hacking." Conservative decoding makes outputs less repetitive and less weird. Each was a fix.

The homogenization wasn't the goal. It took at least a year to publish a rigorous measurement of it (Kirk et al., "Understanding the Effects of RLHF on LLM Generalisation and Diversity," 2023). The mathematical proof that KL-regularized RL is designed to mode-collapse came two years later still.

This is a recurring pattern in machine learning, and it's worth pausing on. We optimize for X. We get X. We also get Y, where Y is some emergent property of how we got X — and we don't notice Y until much later, because we weren't looking for it. The interesting question is whether anyone could have noticed earlier, or whether the side effect only becomes visible once enough people use the system to make it visible.

The proof — on real, popular models

I want to be careful here. This isn't the kind of claim where you can wave at general principles and call it done. The question is empirical: do the actual commercial models — the ones you and I use — homogenize their users' output? And the answer, measurably, is yes.

The cleanest experiment I have found is Padmakumar and He (ICLR 2024). They had people write essays in three conditions: alone, with base GPT-3, and with InstructGPT (the RLHF'd version). Writing with InstructGPT produced statistically significant reductions in lexical and content diversity between authors. Writing with base GPT-3 did not. Same underlying language model. Different training stages. The homogenization came from the alignment, exactly where the theory said it would.

Doshi and Hauser (Science Advances, 2024, N=293) replicated the effect with GPT-4 on short-story writing: AI-assisted stories were rated more creative individually, but more similar to each other across writers. The phrase to remember is "individual up, collective down."

Two complementary findings sharpen the picture. Anderson, Shah, and Kreminski (2024) measured within-model homogenization — when many users use the same model, their outputs converge. Wenger and Kenett (2025) measured cross-model homogenization — different LLMs produce responses more similar to each other than humans do to each other. The compression is happening at both layers.

So when I say "the homogenizer is real," I'm not extrapolating. It has been measured on the products you use, by name.

When "average" is the wrong answer

Some tasks have a modal answer that is approximately correct. How do I parse JSON in Python? The most common answer is json.loads(...), and that's also the right answer. The homogenizer is fine, even helpful — it compresses a population of plausible responses into the one you wanted.

But there is a whole other class of tasks where the right answer lives in the tail of the distribution — where being modal is being wrong.

The METR study (July 2025) is the cleanest demonstration I know of. Sixteen experienced open-source maintainers, working on real issues in their own large repositories, were measured with and without AI assistance. With AI, they were nineteen percent slower. They predicted they would be twenty-four percent faster. After the study, they reported feeling twenty percent faster than they actually were. The size of the gap between perception and reality is what gets me — not the slowdown itself, but the fact that experienced engineers couldn't tell.

Why does this fit a section about average being the wrong answer? Because senior engineering on a large, mature codebase is exactly the kind of task where the right move is rarely the modal one. The architecture has its own conventions, the bugs have their own history, the fix has to thread constraints that no average codebase has. The AI keeps proposing the typical pattern. The human keeps having to evaluate, correct, or override it. The friction is the cost of being pulled toward the center on a task whose answer lives in the tail.

Si, Yang, and Hashimoto (Stanford, 2024 → 2026) ran the experiment most directly relevant to research and creative work. They had experts and LLMs propose research ideas. At the proposal stage, LLM ideas were rated more novel than the experts'. Then they had forty-three experts spend over a hundred hours each actually executing the ideas. After execution, the LLM ideas dropped significantly more than the human ideas on novelty, effectiveness, excitement, and overall quality. The novelty was a mirage — surface-level surprise that didn't survive contact with reality.

The Brynjolfsson, Li, and Raymond NBER study on customer support agents found a thirty-four percent productivity gain for novices and approximately zero for the most skilled agents. The mean comes up. The top doesn't move.

A specific framing I keep coming back to: the homogenizer compresses the experience curve. It pulls everyone toward a competent middle. For a novice that is a lift; for an expert it is a leash.

So what

Here is what I am actually claiming, after all of that.

The homogenizer is real. It is empirically measured on the actual products you and I use. Its mechanism is well-understood, and the math is published. The negative effect on outlier-requiring tasks is also empirically measured, although the precise causal link from homogenizer to specific failure is more inferred than directly proven.

What I find most interesting is that none of this is intrinsic to AI. The homogenizer is a stack of human choices — preference labels, KL coefficients, decoding parameters — each of which is modifiable. There are research teams actively working on all of them. The diverse-persona prompting work of Wan & Kalman (2025) shows that varying personas at inference time can preserve output diversity at roughly human-baseline levels — no retraining needed. The "centaur" and "cyborg" usage patterns from the BCG/Harvard study show that deliberate workflow design — humans supplying taste, AI supplying candidates, with clean handoffs in between — can invert the homogenization default.

So the homogenization is not a fate. It is a default.

Which raises the actually uncomfortable question, the one I am not sure how to answer: if the path of least resistance is the homogenized path, and most people will take the path of least resistance, what does it look like, ten years from now, when most of the writing and thinking and problem-solving in the world has been quietly compressed toward the mean?

I don't know. But I think it is worth asking before we find out.

References

The studies and papers referenced in this article, in the order they appear.

The opening — Galton and the regression-to-the-mean paper

Galton, F. (1886). Regression towards Mediocrity in Hereditary Stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263. PDF
Keon, M., Karim, A., Lohana, B., Karim, A., Nguyen, T., Hamilton, T., & Abbas, A. (2025). Galton's Law of Mediocrity: Why Large Language Models Regress to the Mean and Fail at Creativity in Advertising. arXiv:2509.25767. Link

The mechanism — RLHF, mode collapse, and typicality bias

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to Summarize from Human Feedback. NeurIPS 2020. arXiv:2009.01325. Link
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback ("InstructGPT"). arXiv:2203.02155. Link
Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., & Raileanu, R. (2023). Understanding the Effects of RLHF on LLM Generalisation and Diversity. arXiv:2310.06452. Link
GX-Chen, A., et al. (2025). KL-Regularized Reinforcement Learning is Designed to Mode Collapse. arXiv:2510.20817. Link
Zhang et al. (2025). Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity. OpenReview. Link — on typicality bias in human preferences.

The proof on commercial models

Padmakumar, V., & He, H. (2024). Does Writing with Language Models Reduce Content Diversity? ICLR 2024. arXiv:2309.05196. Link
Doshi, A. R., & Hauser, O. P. (2024). Generative AI Enhances Individual Creativity but Reduces the Collective Diversity of Novel Content. Science Advances. Link (preprint: arXiv:2312.00506)
Anderson, B. R., Shah, J. H., & Kreminski, M. (2024). Homogenization Effects of Large Language Models on Human Creative Ideation. ACM Creativity & Cognition '24. Link
Wenger, E., & Kenett, Y. (2025). We're Different, We're the Same: Creative Homogeneity Across LLMs. arXiv:2501.19361. Link

Where the homogenizer's harm shows up

METR (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv:2507.09089. Paper · Blog post
Si, C., Yang, D., & Hashimoto, T. (2024). Can LLMs Generate Novel Research Ideas? OpenReview PDF
Si, C., Yang, D., & Hashimoto, T. (2026). The Ideation–Execution Gap: Execution Outcomes of LLM-Generated vs. Human Research Ideas. Stanford SALT Lab. Link
Brynjolfsson, E., Li, D., & Raymond, L. R. (2023). Generative AI at Work. NBER Working Paper 31161. Link

The "default, not a fate" counter — what we can do about it

Dell'Acqua, F., et al. (2023). Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Harvard Business School Working Paper 24-013. Link
Mollick, E. (2023). Centaurs and Cyborgs on the Jagged Frontier. One Useful Thing. Link — readable companion piece on the centaur usage pattern.
Wan, Y., & Kalman, Y. M. (2025). Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation. arXiv:2504.13868. Link

Command Palette