Abstract
LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
Abstract (translated)
LLM-as-a-Judge作为一种基准测试评估方法以及模型训练中的监督奖励方法得到了广泛应用。然而,尽管它们在许多领域表现出色,但潜在问题尚未得到充分探讨,这削弱了它们的可靠性和应用范围。因此,我们识别出12个潜在偏见,并提出了一个新的自动偏见量化框架——CALM,通过使用自动和基于原则的修改系统有系统地量化并分析LLM-as-a-Judge中的每种偏见。我们的实验覆盖了多种流行语言模型,实验结果表明,尽管高级模型取得了可圈可点的整体性能,但在某些具体任务上仍然存在显著的偏见。经验结果表明,LLM-as-a-Judge的可靠性仍有待提高。此外,我们还讨论了这些偏见的显性和隐性影响,并给出了一些建议,以便用户在LLM-as-a-Judge应用中谨慎使用。我们的工作突出了利益相关者需要解决这些问题,并提醒用户在LLM-as-a-Judge应用中要谨慎使用。
URL
https://arxiv.org/abs/2410.02736