Abstract
The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.
Abstract (translated)
大规模语言模型(LLMs)的零 shot 能力已经使得各种任务具有高度灵活、参考无关的指标,使得 LLM 评估者成为自然语言处理(NLP)中常见的工具。然而,这些 LLM 评估者的稳健性仍然相对较少被研究;现有的工作主要追求将 LLM 分数与人类专家评分相关联的最好性能。在本文中,我们使用 SummEval 数据集进行了一系列分析,证实 LLMs 是偏差评估者,因为它们:(1)表现出熟悉性偏见,即对较低的歧义文本有偏好;(2)显示评分分布偏斜和不均匀;(3)在多属性判断中经历锚定效应。我们还发现 LLMs 是不可靠的评估者,表现出“样本间”一致性低和对于提示差异对人类理解文本质量的影响微不足道的低敏感度。此外,我们还分享了配置 LLM 评估器以减轻这些限制的食谱。在 RoSE 数据集上的实验结果表明,LLM 评估器已经超越了最先进的水平。
URL
https://arxiv.org/abs/2405.01724