Paper Reading AI Learner

From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

2023-12-09 03:16:09
Yin Chen, Jia Li, Shiguang Shan, Meng Wang, Richang Hong

Abstract

Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.

Abstract (translated)

野外的动态面部表情识别(DFER)仍然受到数据限制的影响,例如,姿态、遮挡和光照不足的数量和多样性,以及面部表情的固有歧义性。相比之下,静态面部表情识别(SFER)目前表现出更高的性能,并可以从更丰富的高质量训练数据中受益。此外,DFER中表情特征和动态关系的隐蔽特征仍没有被充分利用。为解决这些挑战,我们引入了一种新颖的静态到动态模型(S2D),它利用现有的SFER知识以及从提取到的面部关键点感知特征中隐含的动态信息,从而显著提高了DFER的性能。首先,我们为SFER构建和训练了一个图像模型,该模型仅包含标准的Vision Transformer(ViT)和多视角互补提示(MCPs)。然后,通过在图像模型中插入时间建模器(TMAs),我们获得DFER的动态模型。MCPs通过来自标准面部关键点检测器的标记感知特征增强面部表情特征。TMAs捕捉并建模面部表情中动态变化之间的关系,从而有效地扩展了预训练的图像模型。值得注意的是,MCPs和TMAs仅增加了训练参数的一小部分(不到+10%)。此外,我们还通过自监督损失基于情感锚定物(i.e.为每个情感类别提供的参考样本)来降低模糊情感标签的负面影响,进一步增强我们的S2D。在流行SFER和DFER数据集上进行实验证明,我们达到了最先进水平。

URL

https://arxiv.org/abs/2312.05447

PDF

https://arxiv.org/pdf/2312.05447.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot