Paper Reading AI Learner

SimFLE: Simple Facial Landmark Encoding for Self-Supervised Facial Expression Recognition in the Wild

2023-03-14 06:30:55
Jiyong Moon, Seongsik Park

Abstract

One of the key issues in facial expression recognition in the wild (FER-W) is that curating large-scale labeled facial images is challenging due to the inherent complexity and ambiguity of facial images. Therefore, in this paper, we propose a self-supervised simple facial landmark encoding (SimFLE) method that can learn effective encoding of facial landmarks, which are important features for improving the performance of FER-W, without expensive labels. Specifically, we introduce novel FaceMAE module for this purpose. FaceMAE reconstructs masked facial images with elaborately designed semantic masking. Unlike previous random masking, semantic masking is conducted based on channel information processed in the backbone, so rich semantics of channels can be explored. Additionally, the semantic masking process is fully trainable, enabling FaceMAE to guide the backbone to learn spatial details and contextual properties of fine-grained facial landmarks. Experimental results on several FER-W benchmarks prove that the proposed SimFLE is superior in facial landmark localization and noticeably improved performance compared to the supervised baseline and other self-supervised methods.

Abstract (translated)

在野生面部表情识别中(FER-W)的一个关键问题是如何编辑大规模标记的面部图像,这些图像需要进行昂贵的标签。因此,在本文中,我们提出了一种自监督的简单面部地标编码方法(SimFLE),该方法可以学习有效的面部地标编码,这是改善FER-W性能的重要特征,而无需昂贵的标签。具体而言,我们介绍了一种新型的FaceMAE模块,该模块使用精心设计的语义掩码来重构 mask 过的面部图像。与以前的随机掩码不同,语义掩码基于主干线的通道信息进行,因此可以探索通道丰富的语义。此外,语义掩码过程是完全可训练的,使 FaceMAE 可以指导主干线学习精细面部地标的空间细节和上下文特性。在多个FER-W基准测试中,实验结果表明, proposed SimFLE 在面部地标定位和显著改进性能方面比监督基线和其他自监督方法更好。

URL

https://arxiv.org/abs/2303.07648

PDF

https://arxiv.org/pdf/2303.07648.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot