Paper Reading AI Learner

Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

2024-06-29 04:35:08
Yangzhou Jiang, Yinxin Lin, Yaoming Wang, Teng Li, Bilian Ke, Bingbing Ni

Abstract

Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.

Abstract (translated)

基于外观的监督方法在最近的光注估计任务中取得了巨大的进展。然而,强烈的标注需求抑制了当前方法实现工业级精度和鲁棒性。尽管当前的无监督预训练框架在许多图像识别任务中取得了成功,但由于面部和眼睛特征之间的深入耦合,这些框架仍然缺乏从全脸上提取有用的目光特征。为了减轻上述限制,本文提出了一种新颖的无监督/自监督目光预训练框架,通过合作特征对比和挤压模块,迫使全脸分支学习一个低维度的目光嵌入,而无需 gaze 注释。这个框架的核心是一个交替的眼关注/不关注掩码训练计划,通过通过注入瓶颈设计将目光相关的信息从全脸分支压缩到眼掩码自编码器中,从而成功引导模型更加关注目光方向而非面部纹理,同时仍然采用眼自重建目标。同时,为了进一步提高学习到的表示,还设计了一个新颖的眼/目光相关信息的对比损失,强制模型将注意力集中在眼中心区域。在多个目光基准测试上进行的大量实验结果表明,与无监督状态下的最先进方法相比,所提出的方案具有卓越的性能。

URL

https://arxiv.org/abs/2407.00315

PDF

https://arxiv.org/pdf/2407.00315.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot