Abstract
Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.
Abstract (translated)
基于外观的监督方法在最近的光注估计任务中取得了巨大的进展。然而,强烈的标注需求抑制了当前方法实现工业级精度和鲁棒性。尽管当前的无监督预训练框架在许多图像识别任务中取得了成功,但由于面部和眼睛特征之间的深入耦合,这些框架仍然缺乏从全脸上提取有用的目光特征。为了减轻上述限制,本文提出了一种新颖的无监督/自监督目光预训练框架,通过合作特征对比和挤压模块,迫使全脸分支学习一个低维度的目光嵌入,而无需 gaze 注释。这个框架的核心是一个交替的眼关注/不关注掩码训练计划,通过通过注入瓶颈设计将目光相关的信息从全脸分支压缩到眼掩码自编码器中,从而成功引导模型更加关注目光方向而非面部纹理,同时仍然采用眼自重建目标。同时,为了进一步提高学习到的表示,还设计了一个新颖的眼/目光相关信息的对比损失,强制模型将注意力集中在眼中心区域。在多个目光基准测试上进行的大量实验结果表明,与无监督状态下的最先进方法相比,所提出的方案具有卓越的性能。
URL
https://arxiv.org/abs/2407.00315