A new wave of startups is grappling with the looming crisis facing the AI industry: the depletion of data. Artificial intelligence, particularly large language models, relies heavily on data for training, but the available data is finite and running out. Companies have tapped into various sources, including public posts, copyrighted materials, and even the entire internet, to train their AI models. This “data wall” is expected to be reached by 2026, prompting startups to explore new solutions.
One approach is the creation of artificial data, such as synthetic data offered by companies like Gretel. Synthetic data mimics real information but is generated by AI, providing a solution for companies facing data scarcity. However, synthetic data has its limitations, such as potentially exacerbating biases and lacking outliers found in real data. To mitigate these issues, Gretel requires customers to provide a portion of real data for comparison.
Another strategy to overcome the data wall involves human labor. Startups like Scale AI and Toloka employ large numbers of workers to clean and label existing data or create new data for AI training. Scale AI, a $14 billion company, has a workforce of 200,000 human annotators, while Toloka has crowdsourced millions of workers worldwide. These human workers play a crucial role in improving the quality and relevance of data for AI models, but also face challenges such as low pay and the need for oversight to ensure accuracy and authenticity.
Some researchers argue for a shift towards using less data, emphasizing the importance of efficiency in AI training. While large language models have been dominant in the industry, there is a growing trend towards smaller, specialized models that require less data. Startups like Mistral AI and Snorkel AI are focusing on developing compact, task-specific models that are tailored to the needs of businesses. By maximizing the quality and specificity of data, these startups aim to enhance the performance of AI models without relying on massive amounts of data.
As the AI industry grapples with the scarcity of data, startups are innovating new approaches to training AI models. From synthetic data generation to human data labeling and specialized model development, these companies are paving the way for a more efficient and sustainable AI ecosystem. With the looming data wall on the horizon, these startups are working towards solutions that balance the need for data with the importance of quality and specificity in AI training. As the industry continues to evolve, these startups are poised to play a key role in shaping the future of AI technology.