Tools & Strategies News

Large Language Model Diagnoses, Triages Without Introducing Biases

UCLA Health researchers determined that Generative Pre-trained Transformer 4 can accurately diagnose and triage patients without introducing racial and ethnic biases.

GPT-4 in healthcare

Source: Getty Images

By Shania Kennedy

- Researchers from University of California, Los Angeles (UCLA) Health have demonstrated that Generative Pre-trained Transformer 4 (GPT-4) can diagnose and triage various health conditions on par with board-certified physicians without introducing racial and ethnic biases, according to a recent study published in JMIR Medical Education.

GPT-4 is a type of conversational artificial intelligence (AI), also known as a large language model (LLM), designed to generate text outputs based on image and text inputs. The model ‘learns’ from publicly available data, such as internet data, to predict the next word or phrase in a body of text, a capability that can be used to respond to a variety of queries.

The research team noted that while LLMs like GPT-4 are becoming more common in healthcare settings, the ability of these tools to accurately diagnose and triage has not been widely assessed. Further, whether or not GPT-4’s recommendations will contain racial and ethnic biases has not been well studied, the researchers indicated.

To remedy this, the research team set out to determine whether GPT-4 can accurately diagnose and triage health conditions, in addition to whether the tool presents racial and ethnic biases in its decisions.

To do this, the researchers compared the performance of GPT-4 to that of three board-certified physicians. The LLM and the clinicians were presented with 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023.

From there, the AI and the physicians were tasked with identifying the most likely primary diagnosis and triage level: emergency, non-emergency, or self-care.

Independent reviewers evaluated each diagnosis as ‘correct’ or ‘incorrect,’ and physician diagnosis was defined as the consensus of the three clinicians. The researchers then assessed whether GPT-4’s performance varied by race and ethnicity by adding information about patient race and ethnicity to the clinical vignettes.

The results showed that GPT-4 performed similarly to the clinicians without introducing biases.

Accuracy of diagnosis was similar between the tool and the physicians, with the percentage of correct diagnosis being 97.8 percent for GPT-4 and 91.1 percent for physicians. GPT-4 also provided appropriate reasoning for its recommendations in 97.8 percent of clinical vignettes.

The appropriateness of triage was comparable between GPT-4 and clinicians, with both selecting the appropriate level of triage in 66.7 percent of vignettes.

GPT-4’s diagnostic performance did not significantly vary based on patient race or ethnicity, even when this information was included in clinical vignettes. The LLM’s accuracy of triage was 62.2 percent for Black patients; 66.7 percent for White patients; 66.7 percent for Asian patients, and 62.2 percent for Hispanic patients.

These findings led the researchers to conclude that GPT-4 has the ability to diagnose and triage health conditions in a manner comparable to board-certified physicians without introducing racial and ethnic biases, which may help health systems looking to leverage conversational AI.

“The findings from our study should be reassuring for patients, because they indicate that large language models like GPT-4 show promise in providing accurate medical diagnoses without introducing racial and ethnic biases,” said senior author Yusuke Tsugawa, MD, PhD, associate professor of medicine in the division of general internal medicine and health services research at the David Geffen School of Medicine at UCLA, in a press release. “However, it is also important for us to continuously monitor the performance and potential biases of these models as they may change over time depending on the information fed to them.”

The research team also noted that the study had multiple limitations. For example, the clinical vignettes provided summary information that the tool and the clinicians used to recommend diagnoses and triage levels. While the vignettes are based on real-world cases, the researchers cautioned that in clinical practice, physicians typically have more detailed information.

Further, GPT-4’s responses are largely dependent on how queries are worded and the tool could have ‘learned’ from vignettes used early in the study to improve its performance on those provided later.

Finally, the research team indicated that their findings may not be applicable to other conversational AI tools.