Paper Reading AI Learner

Causal Evaluation of Language Models

2024-05-01 16:43:21
Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu

Abstract

Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types. Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale). Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development. Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments. We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements. Project website is at this https URL.

Abstract (translated)

因果推理被认为是实现人类水平机器智能的关键。近年来,自然语言处理模型(NLP)的进步扩展了人工智能在各个领域的视野,引发了关于它们进行因果推理潜力的讨论。在这项工作中,我们引入了Causal评估语言模型(CaLM),据我们所知,这是评估语言模型因果推理能力的第一個全面基准。首先,我们提出了CaLM框架,该框架由四个模块组成:因果目标(即要评估的内容)、适应(即如何获得结果)、指标(即如何衡量结果)和错误(即如何分析不良结果)。这个分类定义了一个广泛的评估设计空间,并系统地选择标准和优先级。其次,我们将CaLM数据集组合成包含126,334个数据样本的 curated集,提供了精心挑选的因果目标、适应、指标和错误,为各种研究提供了广泛的覆盖。第三,我们对92个因果目标、9个适应、7个指标和12个错误类型的28个领先语言模型进行了广泛的评估。第四,我们详细分析了评估结果在不同维度(例如适应、规模)上的情况。第五,我们在9个维度(例如模型)上呈现了50个高级实证发现,为未来的语言模型发展提供了宝贵的指导。最后,我们开发了一个多方面的平台,包括网站、排行榜、数据集和工具包,以支持可扩展和适应性评估。我们设想CaLM成为社区不断演变的基准,定期更新以反映持续的研究进步。项目网站此刻位于https://www.projecturl.com。

URL

https://arxiv.org/abs/2405.00622

PDF

https://arxiv.org/pdf/2405.00622.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot