Paper Reading AI Learner

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

2024-04-14 17:33:33
Wiktor Mucha, Martin Kampel

Abstract

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.

Abstract (translated)

动作识别对于自恋型视频理解至关重要,可以无需用户努力自动持续监测日常生活活动(ADLs)的动作。现有文献主要关注3D手势输入,需要计算密集的深度估计网络或佩戴不舒适的深度传感器。相比之下,对于自恋型动作识别,尽管市场上有用户友好的智能眼镜可以捕捉到一个单色RGB图像,但关于2D手势识别的研究却相对不足。我们的研究旨在填补这一研究空白,通过探索2D手势估计领域,做出两点贡献。首先,我们引入了两种新的2D手势估计方法,即EffHandNet单手估计和EffHandEgoNet,专为自恋视角设计,可以捕捉手与物体之间的交互。这两项方法在H2O和FPHA公开基准上均超越了最先进的模型。其次,我们提出了一个自适应的动作识别架构,包括EffHandEgoNet和基于Transformer的动作识别方法。在H2O和FPHA数据集上评估,我们的架构具有更快的推理时间,准确率分别为91.32%和94.43%,均超越了最先进的方法,包括基于3D的方法。我们的工作表明,使用2D骨骼数据对于自恋型动作理解是一种稳健的方法。广泛的评估和消融研究证明了手势估计方法的影响,以及每个输入如何影响整体性能。

URL

https://arxiv.org/abs/2404.09308

PDF

https://arxiv.org/pdf/2404.09308.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot