In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.

Abstract (translated)

动作识别对于自恋型视频理解至关重要，可以无需用户努力自动持续监测日常生活活动（ADLs）的动作。现有文献主要关注3D手势输入，需要计算密集的深度估计网络或佩戴不舒适的深度传感器。相比之下，对于自恋型动作识别，尽管市场上有用户友好的智能眼镜可以捕捉到一个单色RGB图像，但关于2D手势识别的研究却相对不足。我们的研究旨在填补这一研究空白，通过探索2D手势估计领域，做出两点贡献。首先，我们引入了两种新的2D手势估计方法，即EffHandNet单手估计和EffHandEgoNet，专为自恋视角设计，可以捕捉手与物体之间的交互。这两项方法在H2O和FPHA公开基准上均超越了最先进的模型。其次，我们提出了一个自适应的动作识别架构，包括EffHandEgoNet和基于Transformer的动作识别方法。在H2O和FPHA数据集上评估，我们的架构具有更快的推理时间，准确率分别为91.32%和94.43%，均超越了最先进的方法，包括基于3D的方法。我们的工作表明，使用2D骨骼数据对于自恋型动作理解是一种稳健的方法。广泛的评估和消融研究证明了手势估计方法的影响，以及每个输入如何影响整体性能。

URL

https://arxiv.org/abs/2404.09308

PDF

https://arxiv.org/pdf/2404.09308.pdf

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Abstract

Abstract (translated)

URL

PDF Copy

PDF