End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

2023-10-27 13:23:38
Yiran Guan, Zhuoguang Chen, Wenzheng Zeng, Zhiguo Cao, Yang Xiao


In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to facilitate video gaze estimation via capturing spatial-temporal interaction context among head, face, and eye in an end-to-end learning way, which has not been well concerned yet. The main advantage of MCGaze is that the tasks of clue localization of head, face, and eye can be solved jointly for gaze estimation in a one-step way, with joint optimization to seek optimal performance. During this, spatial-temporal context exchange happens among the clues on the head, face, and eye. Accordingly, the final gazes obtained by fusing features from various queries can be aware of global clues from heads and faces, and local clues from eyes simultaneously, which essentially leverages performance. Meanwhile, the one-step running way also ensures high running efficiency. Experiments on the challenging Gaze360 dataset verify the superiority of our proposition. The source code will be released at this https URL.

在这封信中,我们提出了一个新的方法,Multi-Clue Gaze(MCGaze),以通过捕捉头部、面部和眼睛之间的空间-时间交互上下文来通过端到端学习方式促进视频目光估计,这还没有得到很好的关注。MCGaze的主要优点是,可以协同解决头、面和眼的提示局部定位问题,一次性获得最佳性能。在这个过程中,头、面和眼之间的空间-时间上下文交流发生在提示上。因此,通过融合来自各种查询的特征,最终获得的目光可以同时意识到来自头和面的全局提示以及来自眼睛的局部提示,这本质上代表了性能。同时,一步运行方式还确保了高运行效率。对具有挑战性的Gaze360数据集的实验证实了我们的提议具有优越性。源代码将在此处https URL上发布。



