Paper Reading AI Learner

HTNet for micro-expression recognition

2023-07-27 06:04:20
Zhifeng Wang, Kaihao Zhang, Wenhan Luo, Ramesh Sankaranarayana


Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: \url{this https URL}

Abstract (translated)

面部表情与面部肌肉收缩有关,不同的肌肉运动对应着不同的情感状态。对于微表情识别,肌肉运动通常比较微妙,这会对当前面部情感识别算法的性能产生负面影响。大多数现有方法使用自注意力机制来捕捉序列中的 token 之间的关系,但它们没有考虑到面部地标的内在空间关系。这可能会导致在微表情识别任务中的 sub-optimal 表现。因此,学习识别面部肌肉运动是微表情识别领域的一个关键挑战。在本文中,我们提出了一种Hierarchical Transformer Network (HTNet)来识别面部肌肉运动的关键区域。HTNet 包括两个主要组件:一个Transformer层,利用 local Temporal 特征,另一个是聚合层,提取 local 和 global 语义面部特征。具体来说,HTNet将面部分为四个不同的面部区域:左唇区、左眼区、右眼区和右唇区。Transformer 层用于在每个区域中 local 自注意力地代表 local 的小肌肉运动。聚合层用于学习眼区和唇区之间的相互作用。在四个公开可用的微表情数据集上的实验表明, proposed 方法比先前方法表现更好。代码和模型可在 \url{this https URL} 找到。



3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot