Paper Reading AI Learner

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

2024-04-23 15:52:52
Raphael Poulain, Hamed Fayyaz, Rahmatollah Beheshti

Abstract

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.

Abstract (translated)

大语言模型(LLMs)已成为告知临床决策过程的有影响力的候选人。尽管这些模型在塑造数字格局方面扮演越来越重要的角色,但在医疗领域应用中出现了两个不断增长的关注点:1)基于患者受保护属性(如种族)的社交偏见程度有多大,以及2)设计选择(如建筑设计和提示策略)如何影响观察到的偏见?为了回答这些问题,我们通过临床案例(患者描述)标准化评估了三个问题回答(QA)数据集中的八个流行LLM。我们采用红队策略分析 demographic(人口统计学)如何影响LLM输出,比较了通用模型和临床训练模型。我们广泛的实验揭示了各种差异(有些非常显著)。我们还观察到了几个反直觉的模式,例如更大模型不一定更不偏见,经过微调的模型在医学数据上不一定比通用模型更好。此外,我们的研究证明了提示设计对偏见模式的影响,表明了具体措辞可以影响偏见模式,并且类比思考方法(如 Chain of Thought)可以有效降低有偏见的结果。与之前的研究一致,我们呼吁对用于临床决策支持应用的LLM进行进一步评估、审查和增强。

URL

https://arxiv.org/abs/2404.15149

PDF

https://arxiv.org/pdf/2404.15149.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot