Paper Reading AI Learner

LocalEyenet: Deep Attention framework for Localization of Eyes

2023-03-13 06:35:45
Somsukla Maiti, Akshansh Gupta

Abstract

Development of human machine interface has become a necessity for modern day machines to catalyze more autonomy and more efficiency. Gaze driven human intervention is an effective and convenient option for creating an interface to alleviate human errors. Facial landmark detection is very crucial for designing a robust gaze detection system. Regression based methods capacitate good spatial localization of the landmarks corresponding to different parts of the faces. But there are still scope of improvements which have been addressed by incorporating attention. In this paper, we have proposed a deep coarse-to-fine architecture called LocalEyenet for localization of only the eye regions that can be trained end-to-end. The model architecture, build on stacked hourglass backbone, learns the self-attention in feature maps which aids in preserving global as well as local spatial dependencies in face image. We have incorporated deep layer aggregation in each hourglass to minimize the loss of attention over the depth of architecture. Our model shows good generalization ability in cross-dataset evaluation and in real-time localization of eyes.

Abstract (translated)

人类机器界面的发展已经成为当代机器促进更多自主和更高效的必要条件。视觉驱动的人类干预是一种有效和方便的方式,用于创建减轻人类错误的界面。面部地标检测对于设计可靠的视觉检测系统非常重要。基于回归的方法能够确保对与面部不同部分对应的地标进行良好的空间定位。但是,仍然可以通过引入注意力来解决改进的空间。在本文中,我们提出了一种叫做Local Eyenet的深度粗到细架构,用于仅训练可以 end-to-end 训练的 eye 区域的定位。模型架构基于栈式漏斗 backbone 建立,学习特征映射中的自我关注,有助于保留面部图像的全局和局部空间依赖关系。在每个漏斗层中,我们进行了深度层聚合,以最小化架构深度中的注意力损失。我们的模型在跨数据集评估和实时眼部定位方面表现出良好的泛化能力。

URL

https://arxiv.org/abs/2303.12728

PDF

https://arxiv.org/pdf/2303.12728.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot