Generative AI tools such as ChatGPT, Gemini, Copilot are now being used to create impressive sentences and paragraphs based on simple text prompts. These tools have been trained on large amounts of text written by humans and scraped from the internet. However, as these AI tools flood the internet with synthetic content, this content is being used to train future generations of AI. If this process continues unchecked, researchers warn that it could lead to disastrous consequences.
A recent study by University of Oxford computer scientist Ilia Shumailov and colleagues published in Nature warns of the risks of training large language models on their own data, leading to model collapse. Model collapse does not mean that generative AI tools will stop working, but rather that their responses will drift further away from the original training data. As these tools train on their own generated data, small errors accumulate, resulting in content that loses the nuance of diverse perspectives and eventually turns into nonsensical gibberish.
The research team experimented with a pretrained language model called the OPT-125m, feeding it Wikipedia articles to fine-tune its responses and then training it on data generated by its own responses. By the ninth generation, the model was spewing nonsense, indicating the serious ramifications of training AI on its own responses. While big AI companies have mechanisms to prevent this type of collapse, as more individuals use language models to train their own AIs, there could be significant consequences if this issue is not addressed.
The concept of model collapse suggests a decrease in the quality of data going into and coming out of generative AI tools. As chatbots have become more dominant in recent years, concerns about the quality of data used to train these models have arisen. Errors in training data can lead to hallucinations by generative AI tools, resulting in plausible but incorrect content generation. If this erroneous content is used to train later versions of the model, it can influence the learning process and potentially break the models.
AI researcher Leqi Liu explains that model collapse refers to a shift away from the original training data used for the models, leading to a loss of information about low probability events or marginalized groups. This shift can result in AI-generated content amplifying bias and sounding homogenous. To prevent bias and model breakdowns, it is important to use a combination of prior human-generated text and AI-generated text for training. By capturing a diverse range of data, such as data from minority groups or unique events, the risk of model collapse can be minimized.
Overall, while companies that market AI tools are likely to catch and address any issues related to model collapse early on, individuals working on smaller-scale models need to be aware of the risks. By maintaining a balance between human-generated text and AI-generated text for training, and ensuring that a diverse range of data is included in the training process, the risks of bias and nonsensical content generation can be mitigated. It is crucial for the AI community to remain vigilant in addressing these challenges to ensure the responsible development and deployment of AI technologies.