Paper Reading AI Learner

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

2024-05-08 05:36:52
Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Abstract

Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: this https URL.

Abstract (translated)

评估自由文本解释是一个多方面、主观和劳动密集型任务。大型语言模型(LLMs)因其一致性、可扩展性和成本效益而具有吸引力。在这项工作中,我们提出了一个由3,500个自由文本解释和每个方面的质量评分组成的新数据集ACORN,并使用它来探讨LLMs如何评估解释。我们观察到,用一个人类评分替换另一个时,有时候会保持一致,但大多数情况下会降低不同设置和质量方面的互信度,表明他们的判断并不总是与人类评分者一致。我们进一步通过比较LLM生成的评分与不同质量方面的多数投票人类评分之间的相关性来量化这个差异。在最佳系统中,Spearman秩相关范围在0.53到0.95之间,平均为0.72,表明适度高但存在不完美的一致性。最后,我们考虑了当人类评分者稀缺时使用LLM作为额外评分者的替代方案,并测量了多数投票标签与有限人类池和LLM作为额外评分者之间的相关性。虽然GPT-4在只有两个人类评分者时提高了结果,但所有观察到的其他情况中,LLM在有三个或更多人类评分者时都是中立的,甚至是有害的。我们公开发布这个数据集,以支持未来在LLM循环评估中的改进:此https URL。

URL

https://arxiv.org/abs/2405.04818

PDF

https://arxiv.org/pdf/2405.04818.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot