Abstract
Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models. While these methods can reflect the model's reasoning, they may not align with human intuition, making the explanations not plausible. In this work, we present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models. This incorporation enhances the plausibility of post-hoc explanations while preserving their faithfulness. Our approach is agnostic to model architectures and explainability methods. We introduce the rationales during model training by augmenting the standard cross-entropy loss with a novel loss function inspired by contrastive learning. By leveraging a multi-objective optimization algorithm, we explore the trade-off between the two loss functions and generate a Pareto-optimal frontier of models that balance performance and plausibility. Through extensive experiments involving diverse models, datasets, and explainability methods, we demonstrate that our approach significantly enhances the quality of model explanations without causing substantial (sometimes negligible) degradation in the original model's performance.
Abstract (translated)
为了更好地理解日益复杂的自然语言处理(NLP)模型,解释性后验方法是重要的工具。然而,这些方法可能不会与人类直觉保持一致,导致解释不可信。在这项工作中,我们提出了一种将理据(文本注释,解释人类决策)纳入文本分类模型中的方法。这种纳入在保留后验解释的准确性的同时提高了后验解释的可信度。我们的方法对模型架构和解释性方法持中立态度。我们通过在模型训练期间增加一种新型的损失函数(受到对比学习启发的交叉熵损失)来引入理据。通过利用多目标优化算法,我们探讨了两个损失函数之间的权衡,并生成了在性能和可信度之间达到帕累托最优的模型的Pareto最优前沿。通过涉及各种模型、数据集和解释性方法的大型实验,我们证明了我们的方法可以显著提高模型的解释质量,而不会导致对原始模型性能的实质性(有时微不足道的)下降。
URL
https://arxiv.org/abs/2404.03098