Paper Reading AI Learner

Large Language Models are Inconsistent and Biased Evaluators

2024-05-02 20:42:28
Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

Abstract

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

Abstract (translated)

大规模语言模型(LLMs)的零 shot 能力已经使得各种任务具有高度灵活、参考无关的指标,使得 LLM 评估者成为自然语言处理(NLP)中常见的工具。然而,这些 LLM 评估者的稳健性仍然相对较少被研究;现有的工作主要追求将 LLM 分数与人类专家评分相关联的最好性能。在本文中,我们使用 SummEval 数据集进行了一系列分析,证实 LLMs 是偏差评估者,因为它们:(1)表现出熟悉性偏见,即对较低的歧义文本有偏好;(2)显示评分分布偏斜和不均匀;(3)在多属性判断中经历锚定效应。我们还发现 LLMs 是不可靠的评估者,表现出“样本间”一致性低和对于提示差异对人类理解文本质量的影响微不足道的低敏感度。此外,我们还分享了配置 LLM 评估器以减轻这些限制的食谱。在 RoSE 数据集上的实验结果表明,LLM 评估器已经超越了最先进的水平。

URL

https://arxiv.org/abs/2405.01724

PDF

https://arxiv.org/pdf/2405.01724.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot