Paper Reading AI Learner

Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions

2024-05-06 07:10:09
Ruizhe Li, Yanjun Gao

Abstract

Large Language Models (LLMs), such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse anchored bias in the GPT-2 family, where they consistently favour the first choice 'A' in MCQs during inference. This anchored bias challenges the integrity of GPT-2's decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the "logit lens" method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice 'A', we effectively mitigate the anchored bias. Our interventions not only correct the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs. Our code is available at this https URL.

Abstract (translated)

大语言模型(LLMs),如GPT-4和LLaMA家族,在各种任务中取得了显著的成功,包括多项选择题(MCQs)。然而,这些模型表现出位置偏见,尤其是在GPT-2家族中,他们在推理过程中始终倾向于第一个选择'A'。这种位置偏见挑战了GPT-2决策过程的完整性,因为它基于位置而不是内容对MCQ的选择产生偏见。在这项研究中,我们利用机制可解释的方法来确定GPT-2模型内部负责这种偏见的模块。我们关注MLP层和注意头,使用"logit透镜"方法来追溯和修改对偏见有贡献的具体值向量。通过在MLP中更新这些向量并重新调整注意模式以抵消对第一个选择'A'的偏好,我们有效地减轻了偏见的程度。我们的干预不仅纠正了偏见,还在各种数据集上提高了GPT-2模型的多项选择题预测准确性。这项工作是对GPT-2模型中锚定偏见进行全面机制分析的第一项工作,引入了针对性的,最小干预策略,显著增强了GPT2模型在MCQs中的韧性和准确性。我们的代码可在此处访问:https://www.acm.org/dl/doi/10.1145/2818201.2818205

URL

https://arxiv.org/abs/2405.03205

PDF

https://arxiv.org/pdf/2405.03205.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot