Abstract
Natural interaction with virtual objects in AR/VR environments makes for a smooth user experience. Gestures are a natural extension from real world to augmented space to achieve these interactions. Finding discriminating spatio-temporal features relevant to gestures and hands in ego-view is the primary challenge for recognising egocentric gestures. In this work we propose a data driven end-to-end deep learning approach to address the problem of egocentric gesture recognition, which combines an ego-hand encoder network to find ego-hand features, and a recurrent neural network to discern temporally discriminating features. Since deep learning networks are data intensive, we propose a novel data augmentation technique using green screen capture to alleviate the problem of ground truth annotation. In addition we publish a dataset of 10 gestures performed in a natural fashion in front of a green screen for training and the same 10 gestures performed in different natural scenes without green screen for validation. We also present the results of our network's performance in comparison to the state-of-the-art using the AirGest dataset
Abstract (translated)
在AR / VR环境中与虚拟对象的自然交互可以提供流畅的用户体验。手势是从现实世界到增强空间的自然延伸,以实现这些相互作用。在自我视图中找到与手势和手相关的区别时空特征是识别自我中心手势的主要挑战。在这项工作中,我们提出了一种数据驱动的端到端深度学习方法,以解决以自我为中心的手势识别问题,该方法结合了自我编码器网络以找到自我特征,以及一个复现神经网络来识别时间上区分的特征。由于深度学习网络是数据密集型的,我们提出了一种新的数据增强技术,使用绿屏捕获来缓解地面实况注释问题。此外,我们在绿色屏幕前面以自然方式发布了10个手势的数据集,用于训练,并且在不同的自然场景中执行相同的10个手势,没有绿色屏幕进行验证。与使用AirGest数据集的最新技术相比,我们还展示了网络性能的结果
URL
https://arxiv.org/abs/1808.05380