Paper Reading AI Learner

Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

2025-04-15 16:37:32
Wang Bill Zhu, Tianqi Chen, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia

Abstract

Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-4o, this http URL, and Claude-3.5-Sonnet -- corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.

Abstract (translated)

癌症患者越来越多地转向大型语言模型(LLM)作为获取医疗信息的新形式的互联网搜索,这使得评估这些模型处理复杂且个性化问题的能力变得至关重要。然而,目前的医学基准测试主要关注医学考试或普通消费者查询的问题,并不评价LLM在面对真实病人详细临床背景下的提问时的表现。在这篇论文中,我们首先用由三位血液肿瘤科医生审核的真实癌症患者提出的相关问题来评估LLM。虽然响应总体上是准确的,GPT-4-Turbo得分达到5分中的4.13分,但这些模型经常未能识别或处理问题中存在的错误假设——这可能对安全医疗决策构成风险。 为了系统地研究这一局限性,我们引入了Cancer-Myth,这是一个包含585个带有错误假设的癌症相关问题的专家验证对抗数据集。在该基准测试上,没有前沿LLM——包括GPT-4o、Claude-3.5-Sonnet等——能在超过30%的情况下纠正这些错误假设。即使是先进的医学代理方法也无法阻止LLM忽视这些问题中的错误假设。 这些发现揭示了LLM临床可靠性的关键缺口,并强调需要在医疗AI系统中建立更强大的安全措施。

URL

https://arxiv.org/abs/2504.11373

PDF

https://arxiv.org/pdf/2504.11373.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot