Abstract
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at this https URL.
Abstract (translated)
事后归因方法旨在通过突出有影响力的输入像素来解释深度学习预测。然而,这些解释非常脆弱:微小且难以察觉的输入扰动可以显著改变归因图,同时保持相同的预测不变。这种易受攻击性削弱了它们的信任度,并要求对像素级别的归因分数进行严格的鲁棒性保证。 我们引入了一个认证框架,该框架通过随机平滑技术为任何黑盒归因方法提供像素级的鲁棒性保证。通过稀疏化和平滑化归因图,我们将任务重新表述为分割问题,并针对$\ell_2$边界扰动验证每个像素的重要性。此外,我们提出了三种评估指标来衡量认证后的鲁棒性、定位精度和忠实度。 对来自5个ImageNet模型的12种归因方法进行了广泛的评估,结果显示我们的认证归因具有稳健性、可解释性和忠实度,在下游任务中能够可靠地使用。代码可在提供的URL获取。
URL
https://arxiv.org/abs/2506.15499