Healthcare chatbots may promote racist misinformation
Every language model demonstrates instances of racist tropes, or a repetition of unsubstantiated claims about race, study says.
Photo: Laurence Dutton/Getty Images
Chatbots based on large language models (LLMs) are being integrated into healthcare systems, but these models may be perpetuating harmful, race-based medical beliefs that could be particularly harmful to Black patients, according to a new study published in Nature.
The researchers assessed four large language models with nine different questions that were interrogated five times each, for a total of 45 responses per model. All models had examples of perpetuating race-based medicine in their responses.
LLMs are being proposed for use in the healthcare setting with some models already connecting to electronic health record systems. But based on the new findings, these LLMs could potentially cause harm by providing life support to debunked, racist ideas.
WHAT'S THE IMPACT?
Recent studies using LLMs have demonstrated their utility in answering medically relevant questions in specialties such as cardiology, anesthesiology and oncology. LLMs are trained on large corpuses of text data, and are engineered to provide human-like responses. Some models, such as Bard, can access the internet.
But the training data used to build the models are not transparent, and this might lead to bias. Such biases include the use of race-based equations to determine kidney function and lung capacity that were built on incorrect, racist assumptions.
One 2016 study showed medical students and residents harbored incorrect beliefs about the differences between white and Black patients on matters such as skin thickness, pain tolerance, and brain size. These differences influenced how these medical trainees reported they would manage patients.
Every LLM model that was put under the microscope demonstrated instances of racist tropes, or a repetition of unsubstantiated claims about race.
Since these models are trained in an unsupervised fashion on large-scale corpuses from the internet and textbooks, they may incorporate older, biased or inaccurate information, since they do not assess research quality, authors determined. And dataset bias can influence model performance. Many LLMs have a training step, dubbed reinforcement learning by human feedback (RLHF), which allows humans to grade the model's responses. It's possible this step helped correct some model outputs, particularly on sensitive questions with known online misinformation, like the relationship between race and genetics.
Most of the models appear to be using older race-based equations for kidney and lung function, which is concerning, since race-based equations lead to worse outcomes for Black patients. In the case of kidney function, the race-based answer appears, regardless of whether race is mentioned in the prompt, while with lung capacity, the concerning responses only appear if race is mentioned in the prompt.
Models also perpetuate false conclusions about racial differences on such topics such as skin thickness and pain threshold, authors said.
THE LARGER TREND
The results suggest LLMs require more adjustment in order to fully eradicate inaccurate, race-based themes. Because of the potential for harm, researchers determined they're not yet ready for clinical use or integration.
They urge medical centers and clinicians to exercise extreme caution in the use of LLMs for medical decision-making, since the models require further evaluation, increased transparency and assessment for potential biases before they can be safely used for medical education, medical decision-making or patient care.
Twitter: @JELagasse
Email the writer: Jeff.Lagasse@himssmedia.com