Paper Reading AI Learner

Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

2023-09-08 09:45:19
Lingyu Du, Xucong Zhang, Guohao Lan

Abstract

Appearance-based gaze estimation has shown great promise in many applications by using a single general-purpose camera as the input device. However, its success is highly depending on the availability of large-scale well-annotated gaze datasets, which are sparse and expensive to collect. To alleviate this challenge we propose ConGaze, a contrastive learning-based framework that leverages unlabeled facial images to learn generic gaze-aware representations across subjects in an unsupervised way. Specifically, we introduce the gaze-specific data augmentation to preserve the gaze-semantic features and maintain the gaze consistency, which are proven to be crucial for effective contrastive gaze representation learning. Moreover, we devise a novel subject-conditional projection module that encourages a share feature extractor to learn gaze-aware and generic representations. Our experiments on three public gaze estimation datasets show that ConGaze outperforms existing unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6% improvement over its supervised learning-based counterpart in cross-dataset evaluations.

Abstract (translated)

基于外观的 gaze 估计在许多应用中表现出巨大的潜力,只需要使用一个通用的摄像头作为输入设备。然而,它的成功高度取决于大规模 well-annotated gaze 数据集的可用性,这些数据集非常稀疏且昂贵地收集。为了减轻这个挑战,我们提出了 ConGaze,一个对比学习为基础的框架,利用未标记的面部图像来在没有监督的情况下学习适用于不同对象的通用 gaze aware 表示。具体来说,我们引入了 gaze 特定的数据增强来保持 gaze 语义特征并维持 gaze 一致性,这些特性是有效的对比学习 gaze 表示学习的关键。此外,我们设计了一个新型的主题条件投影模块,鼓励共享特征提取器来学习 gaze aware 和通用表示。我们对三个公共 gaze 估计数据集的实验表明,ConGaze 在与其他数据集评估相比中表现优异,提高了6.7%到22.5%。

URL

https://arxiv.org/abs/2309.04506

PDF

https://arxiv.org/pdf/2309.04506.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot