As the artificial intelligence (AI) industry continues to surge in popularity, a looming challenge has surfaced – the scarcity of high-quality training data. This shortage threatens to impede the growth of AI models, particularly large language models, and could potentially alter the trajectory of the AI revolution. This article explores why the dwindling data resources are a cause for concern and outlines possible solutions to address this issue.
High-quality data is the lifeblood of AI algorithms, contributing to their accuracy, performance, and overall quality. For instance, ChatGPT, a prominent language model, was trained on a staggering 570 gigabytes of text data, equivalent to approximately 300 billion words. Similarly, the stable diffusion algorithm, which powers AI image-generating applications like DALL-E, Lensa, and Midjourney, relied on the LIAON-5B dataset containing 5.8 billion image-text pairs. Inadequate data can lead to inaccuracies and subpar outputs.
Moreover, the quality of training data plays a pivotal role. Low-quality data, such as social media posts or blurry images, is readily available but insufficient to train high-performing AI models. Text extracted from social media platforms may be tainted by biases, prejudices, disinformation, or illegal content, which AI models can inadvertently replicate. For example, Microsoft’s attempt to train an AI bot using Twitter content resulted in it producing racist and misogynistic outputs.
To mitigate these risks, AI developers seek high-quality content from sources like books, online articles, scientific papers, Wikipedia, and carefully filtered web content. Even unconventional sources, such as romance novels from self-publishing site Smashwords, have been used to enhance conversational AI like Google Assistant.
Data supply vs. AI demand
While the AI industry has continuously scaled up the size of datasets used for training, the availability of online data is growing at a slower pace. Recent research suggests that at the current rate of AI training, high-quality text data could be exhausted before 2026. Low-quality language data may run out between 2030 and 2050, and low-quality image data between 2030 and 2060. These projections raise concerns about the potential bottleneck in AI development.
The stakes are high in the world of AI, with PwC estimating that AI could contribute up to US$15.7 trillion to the global economy by 2030. However, the looming data shortage threatens to slow down the industry’s development and realization of its potential.
Improved Data Efficiency: AI developers can enhance algorithms to make more efficient use of existing data. In the coming years, they may achieve high-performance AI systems with less data and reduced computational power. This approach would not only alleviate the data shortage but also contribute to reducing AI’s environmental footprint.
Synthetic Data Generation: Another solution is the use of AI to create synthetic data tailored to train specific AI models. Projects are already utilizing synthetic content from data-generating services like Mostly AI, and this approach is likely to become more commonplace in the future.
Exploration of Non-Free Data Sources: Developers are increasingly exploring content beyond freely available online resources. Valuable data held by large publishers and offline repositories, including millions of texts published before the internet era, could serve as new sources for AI projects. Recent moves by companies like News Corp to negotiate content deals with AI developers, where AI companies pay for training data, could help ensure fair compensation for content creators and address the power imbalance.
The AI industry’s dependence on high-quality training data is evident, and the potential shortage of such data could pose challenges to its continued growth. While concerns exist, the situation may not be as dire as it appears. AI developers have avenues to improve data efficiency, create synthetic data, and explore non-free data sources. These strategies, coupled with the ongoing evolution of AI technologies, offer hope for mitigating the risks associated with data scarcity. As the AI revolution unfolds, adaptability and innovation will be key in navigating this emerging challenge and ensuring AI’s continued contribution to the global economy.