Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Abstract
Abstract (translated)
URL
PDF

Abstract

While state-of-the-art facial expression recognition (FER) classifiers achieve a high level of accuracy, they lack interpretability, an important aspect for end-users. To recognize basic facial expressions, experts resort to a codebook associating a set of spatial action units to a facial expression. In this paper, we follow the same expert footsteps, and propose a learning strategy that allows us to explicitly incorporate spatial action units (aus) cues into the classifier's training to build a deep interpretable model. In particular, using this aus codebook, input image expression label, and facial landmarks, a single action units heatmap is built to indicate the most discriminative regions of interest in the image w.r.t the facial expression. We leverage this valuable spatial cue to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \aus map. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with aus maps, simulating the experts' decision process. This is achieved using only the image class expression as supervision and without any extra manual annotations. Moreover, our method is generic. It can be applied to any CNN- or transformer-based deep classifier without the need for architectural change or adding significant training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on Class-Activation Mapping methods (CAMs), and we show that our training technique improves the CAM interpretability.

Abstract (translated)

尽管最先进的面部表情识别（FER）分类器可以达到高水平的准确度，但他们缺乏可解释性，这对于最终用户来说非常重要。为了识别基本的面部表情，专家们不得不求助于将一系列空间动作单元与面部表情关联的代码本。在本文中，我们遵循了这位专家的步骤，并提出了一种可以将空间动作单元（AUS）线索明确融入分类器训练以构建深度可解释模型的学习策略。特别地，使用这个AUS代码本、输入图像表情标签和面部关键点，我们构建了一个单动作单元热力图，表示图像中与面部表情最具判别性的区域。我们利用这个有价值的空间线索来为FER训练一个深度可解释分类器。通过将分类器的空间层特征约束为与AUS映射相关联，我们训练了一个深度可解释分类器，同时正确分类图像，并产生与AUS映射相关的可解释视觉层级关注，模拟了专家的决策过程。仅使用图像类表达作为监督，没有进行任何额外的手动注释，我们证明了我们的方法是通用的。它可用于任何基于CNN或Transformer的深度分类器，而无需进行架构更改或增加显著的训练时间。我们对两个公开基准数据集RAFDB和AFFECTNET的广泛评估表明，我们提出的策略可以在不降低分类性能的情况下提高层间可解释性。此外，我们还研究了一种常见的可解释分类器类型，即基于类激活映射（CAM）的方法，并展示了我们的训练技术可以提高CAM的可解释性。

URL

https://arxiv.org/abs/2402.00281

PDF

https://arxiv.org/pdf/2402.00281.pdf

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Abstract

Abstract (translated)

URL

PDF Copy

PDF