Abstract
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
Abstract (translated)
大型语言模型(LLM)裁判通常与传统的基于算法的指标一起用于任务如摘要生成,因为它们能更好地捕捉语义信息、更擅长推理,并且更能抵抗同义改写。然而,LLM裁判在长度和顺序等方面表现出偏见,并对各种对抗性输入提示敏感。尽管近期研究已经探讨了这些偏见,但很少有研究从与明确重叠度量相关的更细微层面进行分析。在这项工作中,我们提供了一个关于摘要领域的LLM裁判偏见分析,该分析依据人类写作的响应之间的相似程度(通过ROUGE和BLEU等指标测量)。我们测试了9个最近的大规模语言模型,参数数量从10亿到120亿不等,包括Gemma 3和LLaMA 3的变体。我们发现随着被评判摘要与由其他LLM生成的摘要之间的相似性(通过ROUGE和BLEU测量)降低,LLM裁判倾向于更偏好于机器生成而非人类撰写的摘要,这一模式适用于除一个模型之外的所有测试模型,并且独立于各模型自身的偏置倾向。此外,我们发现即使对于重叠度有限的摘要,模型也难以进行评判,这表明在摘要领域中使用LLM作为评判标准应该依赖于超越简单比较的技术手段。
URL
https://arxiv.org/abs/2602.07673