Elizabeth Gibney
nature.com
News: July 24, 24
Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.
“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.
The researchers began by using an LLM to create Wikipedia-like entries, then trained new iterations of the model on text produced by its predecessor. As the AI-generated information — known as synthetic data — polluted the training set, the model’s outputs became gibberish. The ninth iteration of the model completed a Wikipedia-style article about English church towers with a treatise on the many colours of jackrabbit tails (see ‘AI gibberish’).
Here are some thoughts:
This article highlights a concerning phenomenon known as model collapse, which occurs when artificial intelligence (AI) models are trained on text generated by other AI models. This recursive training leads to a degradation in the quality of outputs, ultimately resulting in nonsensical responses. Researchers demonstrated that as AI-generated content increasingly permeates the internet, the reliance on this synthetic data could stifle the advancement of large language models (LLMs) due to a lack of high-quality human-derived training data. The study revealed that even before complete collapse, models trained on AI-generated texts tend to forget less frequent information, which poses significant risks for fair representation of marginalized groups.
Said differently: AI garbage in, AI garbage out.