Paper Reading AI Learner

POSTER V2: A simpler and stronger facial expression recognition network

2023-01-28 10:23:44
Jiawei Mao, Rui Xu, Xuesong Yin, Yuanqi Chang, Binling Nie, Aibin Huang

Abstract

Facial expression recognition (FER) plays an important role in a variety of real-world applications such as human-computer interaction. POSTER V1 achieves the state-of-the-art (SOTA) performance in FER by effectively combining facial landmark and image features through two-stream pyramid cross-fusion design. However, the architecture of POSTER V1 is undoubtedly complex. It causes expensive computational costs. In order to relieve the computational pressure of POSTER V1, in this paper, we propose POSTER V2. It improves POSTER V1 in three directions: cross-fusion, two-stream, and multi-scale feature extraction. In cross-fusion, we use window-based cross-attention mechanism replacing vanilla cross-attention mechanism. We remove the image-to-landmark branch in the two-stream design. For multi-scale feature extraction, POSTER V2 combines images with landmark's multi-scale features to replace POSTER V1's pyramid design. Extensive experiments on several standard datasets show that our POSTER V2 achieves the SOTA FER performance with the minimum computational cost. For example, POSTER V2 reached 92.21\% on RAF-DB, 67.49\% on AffectNet (7 cls) and 63.77\% on AffectNet (8 cls), respectively, using only 8.4G floating point operations (FLOPs) and 43.7M parameters (Param). This demonstrates the effectiveness of our improvements. The code and models are available at ~\url{this https URL}.

Abstract (translated)

面部表情识别(FER)在诸如人机交互等许多实际应用程序中发挥着重要作用。 POSTER V1通过有效地结合面部地标和图像特征,通过二流金字塔交叉融合设计实现了最先进的(SOTA)面部表情识别性能。然而,POSTER V1的架构无疑是复杂的,这导致了昂贵的计算成本。为了减轻POSTER V1的计算压力,在本文中,我们提出了POSTER V2。它从三个方向改进了POSTER V1:交叉融合、二流和多尺度特征提取。在交叉融合中,我们使用窗口基交叉注意力机制取代了传统的交叉注意力机制。在二流设计中,我们删除了图像到地标分支。对于多尺度特征提取,POSTER V2将图像与地标的多尺度特征组合起来,取代了POSTER V1的金字塔设计。对多个标准数据集进行了广泛的实验,表明我们的POSTER V2使用最小计算成本实现了最先进的面部表情识别性能。例如,POSTER V2在RAF-DB上达到了92.21%、AffectNet(7cls)上达到了67.49%、AffectNet(8cls)上达到了63.77%。仅使用8.4G的浮点运算(FLOPs)和43.7M参数(Param),POSTER V2取得了SOTA的面部表情识别性能。这证明了我们改进的有效性。代码和模型可访问~url{this https URL}。

URL

https://arxiv.org/abs/2301.12149

PDF

https://arxiv.org/pdf/2301.12149.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot