Paper Reading AI Learner

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

2025-07-17 17:09:22
Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan

Abstract

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

Abstract (translated)

我们介绍了一个名为AbGen的基准测试,这是首个用于评估大型语言模型(LLM)在为科学研究设计消融研究(ablation study)能力的工具。AbGen包含了1,500个由专家标注的例子,这些例子来源于807篇自然语言处理(NLP)论文。在这个基准测试中,要求LLM根据给定的研究背景生成特定模块或过程的详细消融研究设计。 我们对领先的LLM模型,如DeepSeek-R1-0528和o4-mini进行了评估,结果显示这些模型在消融研究设计的重要性、忠实性和有效性方面与人类专家之间存在显著差异。此外,我们还展示了当前的自动化评价方法对于我们的任务来说不够可靠,因为它们与人工评价相比显示出明显的不一致之处。 为了更深入地探讨这一问题,我们开发了AbGen-Eval,这是一个元评估基准测试工具,旨在评估常用自动评估系统在测量LLM执行复杂科学任务性能时的可靠性。我们在AbGen-Eval上对各种LLM作为评判系统的模型进行了调查,并为未来研究如何发展更加有效和可靠的基于LLM的评价系统提供了见解。

URL

https://arxiv.org/abs/2507.13300

PDF

https://arxiv.org/pdf/2507.13300.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot