From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.

Abstract (translated)

野外的动态面部表情识别（DFER）仍然受到数据限制的影响，例如，姿态、遮挡和光照不足的数量和多样性，以及面部表情的固有歧义性。相比之下，静态面部表情识别（SFER）目前表现出更高的性能，并可以从更丰富的高质量训练数据中受益。此外，DFER中表情特征和动态关系的隐蔽特征仍没有被充分利用。为解决这些挑战，我们引入了一种新颖的静态到动态模型（S2D），它利用现有的SFER知识以及从提取到的面部关键点感知特征中隐含的动态信息，从而显著提高了DFER的性能。首先，我们为SFER构建和训练了一个图像模型，该模型仅包含标准的Vision Transformer（ViT）和多视角互补提示（MCPs）。然后，通过在图像模型中插入时间建模器（TMAs），我们获得DFER的动态模型。MCPs通过来自标准面部关键点检测器的标记感知特征增强面部表情特征。TMAs捕捉并建模面部表情中动态变化之间的关系，从而有效地扩展了预训练的图像模型。值得注意的是，MCPs和TMAs仅增加了训练参数的一小部分（不到+10%）。此外，我们还通过自监督损失基于情感锚定物（i.e.为每个情感类别提供的参考样本）来降低模糊情感标签的负面影响，进一步增强我们的S2D。在流行SFER和DFER数据集上进行实验证明，我们达到了最先进水平。

URL

https://arxiv.org/abs/2312.05447

PDF

https://arxiv.org/pdf/2312.05447.pdf