Coinspeaker
ChatGPT Is Declining in Performance According to Recent Research
ChatGPT has become one of the most popular and powerful AI tools. Users worldwide have found it useful for many functions, from brainstorming content ideas to solving complex mathematical equations. Despite its widespread use, some GPT-4 users have expressed concern about its declining performance over time. And the research has also highlighted some decline in the performance of ChatGPT.
Researchers from Stanford University and the University of California, Berkeley, recently published a study titled “How Is ChatGPT’s Behavior Changing Over Time?” The study explores changes in the outputs of OpenAI’s large language models (LLMs), specifically GPT-3.5 and GPT-4, over the past few months.
The Results of the Research on OpenAI’s ChatGPT Models
The study calls into question GPT-4’s performance in coding and compositional tasks. Using API access, the researchers tested the March and June 2023 versions of these models on a variety of tasks, including math problem-solving, sensitive question answering, code generation, and visual reasoning. Notably, GPT-4’s ability to identify prime numbers decreased significantly from 97.6% in March to only 2.4% in June. GPT-3.5, on the other hand, performed better during the same time period.
For example, GPT-4's success rate on "is this number prime? think step by step" fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely singificant changes in LLM behavior.
— Matei Zaharia (@matei_zaharia) July 19, 2023
Another notable finding from the study was the considerable change in GPT-4’s response length. The average verbosity of GPT-4 decreased significantly, from 821.2 characters in March to just 3.8 characters in June. In contrast, GPT-3.5 experienced approximately 40% growth in its response length during the same period. Moreover, the study found that the overlap between the March and June versions’ answers for both GPT-4 and GPT-3.5 was relatively small for both services.
The study observed distinct changes in how GPT-4 and GPT-3.5 responded to sensitive questions. From March to June, GPT-4’s frequency of answering such questions decreased significantly, dropping from 21.0% to 5.0%. On the other hand, GPT-3.5 demonstrated the opposite trend, with the rate of answering sensitive questions increasing from 2.0% to 8.0% during the same period.
The experts who conducted the research speculated that the June update for ChatGPT (GPT-4) likely incorporated a stronger safety layer, leading to a more conservative approach to handling sensitive queries. In contrast, GPT-3.5 appeared to become less conservative in its responses to such questions.
The research findings emphasize that the behavior of a supposedly consistent LLM service can undergo significant changes in a relatively short period. This highlights the crucial importance of continuous monitoring to ensure and maintain LLM quality.
Critics of GPT-4 have expressed subjective concerns about its declining performance. Some theories suggest that OpenAI might have “distilled” the model to reduce computational overhead, fine-tuned it to minimize harmful outputs, or even intentionally limited its coding capabilities to drive demand for GitHub Copilot.
GPT-4 is getting worse over time, not better.
Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.
But now we know.
At least one study shows how the June version of GPT-4 is objectively worse than… pic.twitter.com/whhELYY6M4
— Santiago (@svpino) July 19, 2023
OpenAI has consistently denied any decline in GPT-4’s capabilities. According to OpenAI’s VP of Product, Peter Welinder, each new version is designed to be smarter than the previous one, and issues may become more noticeable with increased usage.
No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.
Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before.
— Peter Welinder (@npew) July 13, 2023
The research paper challenges the assertion made by OpenAI regarding the intentional improvement of every new version of GPT. One of the co-authors of the research paper, Matei Zaharia, also the Chief Technology Officer at Databricks, expressed concerns on Twitter about the difficulty in managing the quality of AI model responses. He further questioned how well model developers can detect changes and prevent the loss of certain capabilities while introducing new ones.
While the study appears to support critics’ claims, some experts advise caution. Arvind Narayanan, a computer science professor at Princeton, claims that the study’s findings do not prove GPT-4’s decline conclusively. He speculates that the observed changes are consistent with OpenAI’s fine-tuning adjustments. The study, for example, evaluates code generation based on immediate executability rather than correctness, which could lead to misinterpretations.
ChatGPT Is Declining in Performance According to Recent Research