Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Abstract
Abstract (translated)
URL
PDF

Abstract

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.

Abstract (translated)

大语言模型（LLMs）已成为告知临床决策过程的有影响力的候选人。尽管这些模型在塑造数字格局方面扮演越来越重要的角色，但在医疗领域应用中出现了两个不断增长的关注点：1）基于患者受保护属性（如种族）的社交偏见程度有多大，以及2）设计选择（如建筑设计和提示策略）如何影响观察到的偏见？为了回答这些问题，我们通过临床案例（患者描述）标准化评估了三个问题回答（QA）数据集中的八个流行LLM。我们采用红队策略分析 demographic（人口统计学）如何影响LLM输出，比较了通用模型和临床训练模型。我们广泛的实验揭示了各种差异（有些非常显著）。我们还观察到了几个反直觉的模式，例如更大模型不一定更不偏见，经过微调的模型在医学数据上不一定比通用模型更好。此外，我们的研究证明了提示设计对偏见模式的影响，表明了具体措辞可以影响偏见模式，并且类比思考方法（如 Chain of Thought）可以有效降低有偏见的结果。与之前的研究一致，我们呼吁对用于临床决策支持应用的LLM进行进一步评估、审查和增强。

URL

https://arxiv.org/abs/2404.15149

PDF

https://arxiv.org/pdf/2404.15149.pdf

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Abstract

Abstract (translated)

URL

PDF Copy

PDF