Paper Reading AI Learner

Can Large Language Models Explain Themselves?

2024-01-15 19:39:15
Andreas Madsen, Sarath Chandar, Siva Reddy

Abstract

Instruction-tuned large language models (LLMs) excel at many tasks, and will even provide explanations for their behavior. Since these models are directly accessible to the public, there is a risk that convincing and wrong explanations can lead to unsupported confidence in LLMs. Therefore, interpretability-faithfulness of self-explanations is an important consideration for AI Safety. Assessing the interpretability-faithfulness of these explanations, termed self-explanations, is challenging as the models are too complex for humans to annotate what is a correct explanation. To address this, we propose employing self-consistency checks as a measure of faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been applied to LLM's self-explanations. We apply self-consistency checks to three types of self-explanations: counterfactuals, importance measures, and redactions. Our work demonstrate that faithfulness is both task and model dependent, e.g., for sentiment classification, counterfactual explanations are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B. Finally, our findings are robust to prompt-variations.

Abstract (translated)

经过训练的大型语言模型(LLMs)在许多任务上表现出色,甚至可以为他们行为提供解释。由于这些模型对公众直接可用,因此说服力和错误的解释可能导致对LLMs的可靠性产生不支持的观点。因此,在AI安全方面,解释的可信度是一个重要考虑因素。评估这些解释的可信度(称为自我解释)是一个具有挑战性的任务,因为这些模型对于人类来说太过复杂,无法准确标注正确解释。为解决这个问题,我们提出了使用自一致性检查作为可信度的度量。例如,如果一个LLM表示一组单词对于做出预测很重要,那么在没有这些单词的情况下,它应该不能做出相同的预测。尽管自一致性检查是信誉度的常见方法,但之前没有应用于LLM的自我解释。我们对三种类型的自我解释(反例、重要性度量、遮盖)应用自一致性检查。我们的工作证明了信誉度既与任务有关,也与模型有关,例如,对于情感分类,反例解释对Llama2来说更忠实,重要性度量对Mistral来说更准确,遮盖对Falcon 40B来说更准确。最后,我们的研究结果对提示变化具有鲁棒性。

URL

https://arxiv.org/abs/2401.07927

PDF

https://arxiv.org/pdf/2401.07927.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot