Paper Reading AI Learner

GestARLite: An On-Device Pointing Finger Based Gestural Interface for Smartphones and Video See-Through Head-Mounts

2019-04-19 14:32:40
Varun Jain, Gaurav Garg, Ramakrishna Perla, Ramya Hebbalaguppe

Abstract

Hand gestures form an intuitive means of interaction in Mixed Reality (MR) applications. However, accurate gesture recognition can be achieved only through state-of-the-art deep learning models or with the use of expensive sensors. Despite the robustness of these deep learning models, they are generally computationally expensive and obtaining real-time performance on-device is still a challenge. To this end, we propose a novel lightweight hand gesture recognition framework that works in First Person View for wearable devices. The models are trained on a GPU machine and ported on an Android smartphone for its use with frugal wearable devices such as the Google Cardboard and VR Box. The proposed hand gesture recognition framework is driven by a cascade of state-of-the-art deep learning models: MobileNetV2 for hand localisation, our custom fingertip regression architecture followed by a Bi-LSTM model for gesture classification. We extensively evaluate the framework on our EgoGestAR dataset. The overall framework works in real-time on mobile devices and achieves a classification accuracy of 80% on EgoGestAR video dataset with an average latency of only 0.12 s.

Abstract (translated)

手势在混合现实(MR)应用程序中形成了一种直观的交互方式。然而,只有通过最先进的深度学习模型或使用昂贵的传感器才能实现准确的手势识别。尽管这些深度学习模型具有很强的鲁棒性,但它们通常计算成本很高,在设备上获得实时性能仍然是一个挑战。为此,我们提出了一种新型的轻量级手势识别框架,该框架可用于可穿戴设备的第一人称视角。这些模型是在GPU机器上训练的,并移植到安卓智能手机上,以便与诸如谷歌硬纸板和虚拟现实盒等节俭的可穿戴设备一起使用。提议的手势识别框架由一系列最先进的深度学习模型驱动:用于手势定位的mobilenetv2,我们的自定义指尖回归结构,以及用于手势分类的bi-lstm模型。我们广泛地评估了egogestar数据集上的框架。整个框架在移动设备上实时工作,在egogestar视频数据集上实现了80%的分类精度,平均延迟仅为0.12秒。

URL

https://arxiv.org/abs/1904.09843

PDF

https://arxiv.org/pdf/1904.09843.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot