Paper Reading AI Learner

Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

2026-01-21 21:29:46
Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty

Abstract

The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($\kappa \ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{this https URL}{Code available here: this https URL}

Abstract (translated)

在高风险领域如生物医学中安全部署大规模语言模型(LLMs),需要这些模型能够理解和推理因果关系。我们通过测试13种开源的大规模语言模型在一项基本任务上的能力来研究这一问题:从文本中进行成对的因果发现(PCD)。我们的基准测试使用了12个多样化的数据集,评估两种核心技能: 1. **因果检测**(识别文本中是否包含因果联系) 2. **因果提取**(提取具体的因果短语) 我们测试了几种不同的提示方法,从简单的指令(零样本学习)到更复杂的策略如链式思维(CoT)和少样本上下文学习(FICL)。结果显示目前的模型在这些任务上存在重大缺陷。最佳的检测模型DeepSeek-R1-Distill-Llama-70B仅达到了平均49.57% ($C_{detect}$) 的得分,而最佳提取模型Qwen2.5-Coder-32B-Instruct也只有47.12% ($C_{extract}$)。这些模型在处理简单的、明确的单句关系时表现最好,但在面对更复杂的(且现实的)情况如隐含关系、跨越多句话的关系以及包含多个因果对的文本时,性能急剧下降。 我们提供了一个统一的评估框架,该框架建立在一个具有高注释者间一致性($\kappa \ge 0.758$)的数据集上,并将所有数据、代码和提示公开发布以促进进一步的研究。[此处可访问相关代码](this https URL)

URL

https://arxiv.org/abs/2601.15479

PDF

https://arxiv.org/pdf/2601.15479.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot