Abstract
Development of human machine interface has become a necessity for modern day machines to catalyze more autonomy and more efficiency. Gaze driven human intervention is an effective and convenient option for creating an interface to alleviate human errors. Facial landmark detection is very crucial for designing a robust gaze detection system. Regression based methods capacitate good spatial localization of the landmarks corresponding to different parts of the faces. But there are still scope of improvements which have been addressed by incorporating attention. In this paper, we have proposed a deep coarse-to-fine architecture called LocalEyenet for localization of only the eye regions that can be trained end-to-end. The model architecture, build on stacked hourglass backbone, learns the self-attention in feature maps which aids in preserving global as well as local spatial dependencies in face image. We have incorporated deep layer aggregation in each hourglass to minimize the loss of attention over the depth of architecture. Our model shows good generalization ability in cross-dataset evaluation and in real-time localization of eyes.
Abstract (translated)
人类机器界面的发展已经成为当代机器促进更多自主和更高效的必要条件。视觉驱动的人类干预是一种有效和方便的方式,用于创建减轻人类错误的界面。面部地标检测对于设计可靠的视觉检测系统非常重要。基于回归的方法能够确保对与面部不同部分对应的地标进行良好的空间定位。但是,仍然可以通过引入注意力来解决改进的空间。在本文中,我们提出了一种叫做Local Eyenet的深度粗到细架构,用于仅训练可以 end-to-end 训练的 eye 区域的定位。模型架构基于栈式漏斗 backbone 建立,学习特征映射中的自我关注,有助于保留面部图像的全局和局部空间依赖关系。在每个漏斗层中,我们进行了深度层聚合,以最小化架构深度中的注意力损失。我们的模型在跨数据集评估和实时眼部定位方面表现出良好的泛化能力。
URL
https://arxiv.org/abs/2303.12728