Abstract
Instruction-tuned large language models (LLMs) excel at many tasks, and will even provide explanations for their behavior. Since these models are directly accessible to the public, there is a risk that convincing and wrong explanations can lead to unsupported confidence in LLMs. Therefore, interpretability-faithfulness of self-explanations is an important consideration for AI Safety. Assessing the interpretability-faithfulness of these explanations, termed self-explanations, is challenging as the models are too complex for humans to annotate what is a correct explanation. To address this, we propose employing self-consistency checks as a measure of faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been applied to LLM's self-explanations. We apply self-consistency checks to three types of self-explanations: counterfactuals, importance measures, and redactions. Our work demonstrate that faithfulness is both task and model dependent, e.g., for sentiment classification, counterfactual explanations are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B. Finally, our findings are robust to prompt-variations.
Abstract (translated)
经过训练的大型语言模型(LLMs)在许多任务上表现出色,甚至可以为他们行为提供解释。由于这些模型对公众直接可用,因此说服力和错误的解释可能导致对LLMs的可靠性产生不支持的观点。因此,在AI安全方面,解释的可信度是一个重要考虑因素。评估这些解释的可信度(称为自我解释)是一个具有挑战性的任务,因为这些模型对于人类来说太过复杂,无法准确标注正确解释。为解决这个问题,我们提出了使用自一致性检查作为可信度的度量。例如,如果一个LLM表示一组单词对于做出预测很重要,那么在没有这些单词的情况下,它应该不能做出相同的预测。尽管自一致性检查是信誉度的常见方法,但之前没有应用于LLM的自我解释。我们对三种类型的自我解释(反例、重要性度量、遮盖)应用自一致性检查。我们的工作证明了信誉度既与任务有关,也与模型有关,例如,对于情感分类,反例解释对Llama2来说更忠实,重要性度量对Mistral来说更准确,遮盖对Falcon 40B来说更准确。最后,我们的研究结果对提示变化具有鲁棒性。
URL
https://arxiv.org/abs/2401.07927