There’s a general consensus that large language models are going to get better and better. But what if this as good as it gets …before the snake eats its own tail?
The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.
Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale.