Paper Reading AI Learner

MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution

2025-06-17 13:35:06
Zhiwen Shao, Yifan Cheng, Feiran Li, Yong Zhou, Xuequan Lu, Yuan Xie, Lizhuang Ma

Abstract

Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at this https URL.

Abstract (translated)

面部微表情识别(MER)是一个具有挑战性的问题,由于短暂和细微的微表情动作。大多数现有方法依赖于手工设计特征、关键帧如起始点、顶峰点及结束点,或受制于小规模且低多样性数据集的深度网络。在本文中,我们提出了一种结合了变压器、图卷积以及普通卷积优势的端到端微动作感知深度学习框架。 特别地,我们提出了一种新颖的F5C(全连接卷积与通道对应卷积组成)模块,可以直接从一系列原始帧序列提取局部-全局特征,而无需事先知道关键帧的信息。提出的变压器风格的全连接卷积旨在提取局部特征的同时保持全局感受野,图样式通道对应卷积则被引入以建模特征模式之间的相关性。 此外,MER、光流估计和面部标志点检测通过共享局部-全局特征进行联合训练。后两项任务有助于捕捉对微表情识别有用的细微面部动作信息,从而缓解因训练数据不足造成的影响。 广泛的实验表明,我们的框架(i)在CASME II、SAMM 和 SMIC 评估基准上超越了最先进的MER方法,(ii)对于光流估计和面部标志点检测表现出良好的性能,并且(iii)能够捕捉与微表情相关的局部区域内的细微肌肉动作。代码可在[此处](https://this https URL)获取。

URL

https://arxiv.org/abs/2506.14511

PDF

https://arxiv.org/pdf/2506.14511.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot