Abstract
Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias -- an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model's predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability.
Abstract (translated)
大语言模型(LLMs)通过利用包含指令的上下文提示或最小输入-输出示例展示了对于各种任务的显著适应性。然而,最近的工作表明,它们还表现出了标签偏见——对于预测某些答案的偏好,而不是其他答案的预测。然而,在可信赖度和规模上检测和衡量这种偏见仍然是一个相对未探索的问题。在这项研究中,我们评估了在模型预测中量化标签偏见的不同方法,对279个分类任务和10个LLM进行了全面的调查。我们的调查揭示了模型在Debiasing尝试前和之后的标签偏见,并强调了基于结果的评估指标之前在这一点上没有使用的重要性。我们进一步提出了一个针对少样本提示的新型标签偏见校准方法,该方法在提高性能和减轻标签偏见方面优于最近的方法。我们的结果强调了LLM预测中标签偏见仍然是一个对其可靠性的障碍。
URL
https://arxiv.org/abs/2405.02743