Building and supporting modern AI models require significant investments, which may exceed hundreds of millions of dollars. Estimates indicate that these costs may hit a billion dollars in the near future.
This expenditure is mainly due to computing power where entities like Nvidia GPUs are used, which may cost about $30,000 each and may require thousands more to be efficient. Researchers have stated that the quality and quantity of the training data set used in developing such models are very important.
Industry leaders reveal staggering costs of AI development
According to James Betker of OpenAI, the performance of a model is a function of the training data rather than the design or architecture of the model. His assertion is that models trained on big data sets will reach the same results. Therefore, data is the key to the advancement of AI technology.
Dario Amodei, CEO of the AI firm Anthropic AI, shared his insights about the financial aspects of these challenges in the In Good Company podcast. He stated that training the current models, such as ChatGPT-4, is estimated to cost around $100 million, and training for future models may require $10-100 billion in the next few years.
Generative AI models, and the ones created by large firms, are, at their core, statistical models. Therefore, they use a lot of examples to predict the most probable outcomes. Kyle Lo from the Allen Institute for AI (AI2) says that the gain in performance can be mostly attributed to the data, especially when the training environment is consistent.
Data centralization raises ethical and accessibility concerns
The high cost of obtaining good quality data is making the development of AI the preserves of a few large companies in the developed world. This aggregation of resources is also a source of concern regarding the availability of AI technology and the possibility of misuse.
OpenAI alone has spent hundreds of millions of dollars on data licenses, and Meta has considered purchasing publishers for data access. The AI training data market is expected to expand, and data brokers are likely to benefit from this opportunity.
Problems arise from questionable data acquisition practices. According to the reports, many companies have captured large volumes of content without the authorization of the owners of such content, and some companies harness data from different platforms and do not remunerate the users. As we previously reported, OpenAI used its Whisper audio transcription model to transcribe more than a million hours of YouTube videos to fine-tune GPT-4.
Organizations work to create open-access AI training datasets
As the data acquisition race presents some problems, some efforts from independent parties are needed to make training datasets openly available. Some organizations, such as EleutherAI and Hugging Face, are creating large datasets that are available to the public for AI development.
The Wall Street Journal recently highlighted two potential strategies to solve data acquisition issues: generative data generation and curriculum learning. Synthetic data is created using AI models themselves, while curriculum learning tries to provide models with high-quality data in a structured way so that they can make connections even with less data. However, both methods are still in the developmental stages, and their efficacy has not been tested yet.