ChatGPT has the ability to pass medical exams, according to reports, but it will not be a wise decision to rely on it for some serious health assessments, for example, if a patient with chest pain needs to be hospitalized, according to new research.
ChatGPT is clever but fails at heart assessment
In research published in the journal PLOS ONE, ChatGPT provided different conclusions by returning inconsistent heart risk levels for the same patient in a study that involved thousands of chest pain patients.
A researcher at Washington State University’s Elson S. Floyd College of Medicine, Dr. Thomas Heston, who was also the lead author of the research, said,
“ChatGPT was not acting in a consistent manner; given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally it would go as far as giving a high risk.”
Source: WSU.
According to the researchers, the issue is probably due to the degree of randomness built into the recent version of the software, ChatGPT-4, because it helps it diversify its answers to mimic natural language. But Heston says that this same level of randomness does not work for use cases in healthcare and can be dangerous, as it demands a single, consistent answer.
Doctors need to quickly evaluate the urgency of a patient’s condition, as chest pains are an everyday complaint in hospital emergency rooms.
Some of the very serious patients can be easily identified by their symptoms, but the trickier ones are those who have lower risk, said Dr. Heston, especially when they need to decide whether someone is out of risk enough to be sent home with outpatient care services or should be admitted.
Other systems prove more reliable
An AI neural network like ChatGPT, which is trained on a high number of parameters with huge datasets, can assess billions of variables in seconds, which gives it the ability to understand a complex scenario faster and in a much more detailed way.
Dr. Heston says that medical professionals mostly use two models for heart risk assessments called HEART and TIMI, and he likes software as they use a number of variables, including age, health history, and symptoms, and they rely on fewer variables than ChatGPT.
For the research study, Dr. Heston and his coworker, Dr. Lawrence Lewis, of the St. Louis campus of the same university, used three datasets of 10,000 randomly simulated cases each. One data set had five variables from the heart scale; another included seven variables from the TIMI; and the third had 44 variables that were randomly selected.
For the first two datasets, ChatGPT produced inconsistent risk assessment 45% to 48% of the time on the individual simulated cases compared to a constant score of TIMI and HEART. But for the third dataset, despite running it multiple times, ChatGPT returned different results for the same cases.
Dr. Heston thinks that there is greater potential for GenAI in healthcare as the technology advances, despite the unsatisfactory findings of the study. According to him, medical records can be uploaded to the systems, and if an emergency arrives, doctors could ask ChatGPT to provide the most important facts about the patient. It can also be asked to generate some possible diagnoses and the reasoning for each one, which will help doctors see through a problem.