In an era marked by the rapid evolution of AI technology, the dominance of giants like ChatGPT is being challenged as specialized AI chatbots gain traction. This shift promises to make AI chatbots more useful for specific industries and regions, but it also raises critical questions about data, synthetic data, and the future of AI development.
The specialization of AI chatbots
As the AI landscape evolves, AI chatbots are becoming less generic and more specialized. The key to their enhanced utility lies in the data they are trained on. Traditional AI models like ChatGPT cast a wide net, absorbing vast quantities of data from books, web pages, and more. However, this broad approach is gradually giving way to a more focused selection of training data tailored to specific industries or regions.
This specialization trend offers significant advantages. AI chatbots trained on targeted datasets can provide more accurate and relevant responses to users. For instance, an AI chatbot designed for the healthcare industry can offer specialized medical advice, while one focused on a specific region can provide localized information and insights.
The shifting value of data
To comprehend the evolving AI landscape, it’s crucial to understand the changing value of data. Companies like Meta and Google have long profited from user data by selling targeted advertisements. However, the value of data to organizations like OpenAI, the developer of ChatGPT, is somewhat different. They view data as a means to teach AI systems to construct human-like language.
Consider a simple tweet: “The cat sat on the mat.” While this tweet may not be particularly valuable to advertisers, it serves as a valuable example of human language construction for AI developers. Large language models (LLMs) like GPT-4 are built using billions of such data points from platforms like Twitter, Reddit, and Wikipedia.
This shift in the value of data is also changing the business models of data-rich organizations. Platforms like X and Reddit are now charging third parties for API access to scrape data, leading to increased costs for data acquisition.
The emergence of synthetic data
As data acquisition costs soar, the AI community is exploring synthetic data as a solution. Synthetic data is generated from scratch by AI systems to train advanced AI models. It mimics real training data but is created by AI algorithms.
However, synthetic data presents challenges. It must strike a delicate balance—being different enough to teach models something new while remaining similar enough to be accurate. If synthetic data merely replicates existing information, it can hinder creativity and perpetuate biases.
Another concern is what’s called the “Hapsburg AI” problem. Training AI on synthetic data could lead to a decline in system effectiveness, akin to inbreeding in the Hapsburg royal family. Some studies suggest this is already happening with AI systems like ChatGPT.
The importance of human feedback
One reason for ChatGPT’s success is its use of reinforcement learning with human feedback (RLHF), where human raters assess its outputs for accuracy. As AI systems increasingly rely on synthetic data, the demand for human feedback to correct inaccuracies is likely to grow.
However, assessing factual accuracy, especially in specialized or technical domains, can be challenging. Inaccuracies in specialist topics may go unnoticed by RLHF, potentially impacting the quality of general-purpose LLMs.
The future of AI: Specialized little language models
These challenges in the AI landscape are driving emerging trends. Google engineers have indicated that third parties can recreate LLMs like GPT-3 or LaMDA AI. Many organizations are now building their own internal AI systems using specialized data, tailored to their unique objectives.
For example, the Japanese government is considering developing a Japan-centric version of ChatGPT to better represent their region. Companies like SAP are offering AI development capabilities for organizations to create bespoke versions of ChatGPT. Consultancies like McKinsey and KPMG are exploring training AI models for specific purposes, and open-source systems like GPT4All already exist.
The potential of little language models
In light of development challenges and potential regulatory hurdles for generic LLMs, the future of AI may be characterized by many specialized “little” language models. These models might have less data than systems like GPT-4 but could benefit from focused RLHF feedback.
Employees with expert knowledge of their organization’s objectives can provide valuable feedback to specialized AI systems, compensating for the disadvantages of having less data. These developments signify a shift toward highly tailored AI solutions that cater to specific industries, regions, and purposes.
The AI landscape is undergoing a transformation marked by the rise of specialized AI chatbots and the challenges posed by synthetic data. While giants like ChatGPT continue to dominate, the future of AI may indeed be characterized by many smaller, purpose-built language models designed to excel in specific domains. As this evolution unfolds, striking the right balance between data, synthetic data, and human feedback will be critical to ensuring the continued advancement of AI technology.