Paper Reading AI Learner

Benchmarking Micro-action Recognition: Dataset, Methods, and Applications

2024-03-08 11:48:44
Dan Guo, Kun Li, Bin Hu, Yan Zhang, Meng Wang

Abstract

Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at this https URL.

Abstract (translated)

微动作是一种难以察觉的非语言行为,其特征是低强度运动。它揭示了个体情感和意图,对于诸如情感识别和心理评估等以人为中心的应用具有重要意义。然而,由于这些微行为在日常生活中的难以察觉和无法访问性,识别、区分和理解微动作带来了挑战。在这项研究中,我们创新性地收集了一个名为MA-52的新微动作数据集,并提出了名为微动作网络(MANet)的基准用于微动作识别(MAR)任务。与其他方法不同,MA-52提供了全身视角,包括手势、上半身和下半身运动,试图揭示全面的微动作线索。具体来说,MA-52包含了52个微动作类别以及7个身体部位标签,涵盖了205个参与者以及从心理访谈中收集的22,422个视频实例。基于所提出的数据集,我们评估了MANet和其他9种普遍的动作识别方法。MANet将挤压和兴奋(SE)以及时钟转移模块(TSM)融入ResNet架构,以建模微动作的时空特征。然后,为视频和动作标签之间的语义匹配设计了联合嵌入损失;该损失被用于更好地区分视觉上相似但具有区别的微动作类别。在情感识别的扩展应用中,我们发现我们提出的数据和方法的一个关键价值。在未来的研究中,将深入探讨人类行为、情感和心理评估。数据和源代码发布在https://www. thisurl。

URL

https://arxiv.org/abs/2403.05234

PDF

https://arxiv.org/pdf/2403.05234.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot