Paper Reading AI Learner

Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

2023-03-23 14:35:37
Avi Schwarzschild, Max Cembalest, Karthik Rao, Keegan Hines, John Dickerson

Abstract

As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.

Abstract (translated)

神经网络在高风险环境中越来越频繁地做出关键决策,因此理解和可信地监测和解释其行为是至关重要的。一种常见的解释器类型是后处理特征归因,一种方法 family,为给输入每个特征赋予其对模型输出的影响对应的得分而提供一组方法。在实践中,这个 family 的主要限制是它们可能不同意哪个特征比其他特征更重要。本文的贡献是考虑这个问题并使用一种后处理解释器协议 regularization (PEAR) loss Term,与精度标准 term 一起使用,并添加一个用于衡量两个解释器之间的特征归因差异的新 term。我们在三个数据集上观察,可以训练模型使用这个 loss Term 来提高 unseen 数据下的解释一致性,并观察除了用于 loss Term 的解释器之外之间的更好的一致性。我们研究提高一致性和模型性能之间的权衡。最后,我们研究我们的方法对特征归因解释的影响。

URL

https://arxiv.org/abs/2303.13299

PDF

https://arxiv.org/pdf/2303.13299.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot