Abstract
Large Language Models (LLMs), such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse anchored bias in the GPT-2 family, where they consistently favour the first choice 'A' in MCQs during inference. This anchored bias challenges the integrity of GPT-2's decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the "logit lens" method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice 'A', we effectively mitigate the anchored bias. Our interventions not only correct the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs. Our code is available at this https URL.
Abstract (translated)
大语言模型(LLMs),如GPT-4和LLaMA家族,在各种任务中取得了显著的成功,包括多项选择题(MCQs)。然而,这些模型表现出位置偏见,尤其是在GPT-2家族中,他们在推理过程中始终倾向于第一个选择'A'。这种位置偏见挑战了GPT-2决策过程的完整性,因为它基于位置而不是内容对MCQ的选择产生偏见。在这项研究中,我们利用机制可解释的方法来确定GPT-2模型内部负责这种偏见的模块。我们关注MLP层和注意头,使用"logit透镜"方法来追溯和修改对偏见有贡献的具体值向量。通过在MLP中更新这些向量并重新调整注意模式以抵消对第一个选择'A'的偏好,我们有效地减轻了偏见的程度。我们的干预不仅纠正了偏见,还在各种数据集上提高了GPT-2模型的多项选择题预测准确性。这项工作是对GPT-2模型中锚定偏见进行全面机制分析的第一项工作,引入了针对性的,最小干预策略,显著增强了GPT2模型在MCQs中的韧性和准确性。我们的代码可在此处访问:https://www.acm.org/dl/doi/10.1145/2818201.2818205
URL
https://arxiv.org/abs/2405.03205