Paper Reading AI Learner

Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

2023-08-18 14:41:51
Yunhan Wang, Xiangwei Shi, Shalini De Mello, Hyung Jin Chang, Xucong Zhang

Abstract

With the rapid development of deep learning technology in the past decade, appearance-based gaze estimation has attracted great attention from both computer vision and human-computer interaction research communities. Fascinating methods were proposed with variant mechanisms including soft attention, hard attention, two-eye asymmetry, feature disentanglement, rotation consistency, and contrastive learning. Most of these methods take the single-face or multi-region as input, yet the basic architecture of gaze estimation has not been fully explored. In this paper, we reveal the fact that tuning a few simple parameters of a ResNet architecture can outperform most of the existing state-of-the-art methods for the gaze estimation task on three popular datasets. With our extensive experiments, we conclude that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance while their effectiveness dependent on the quality of the input face image. We obtain the state-of-the-art performances on three datasets with 3.64 on ETH-XGaze, 4.50 on MPIIFaceGaze, and 9.13 on Gaze360 degrees gaze estimation error by taking ResNet-50 as the backbone.

Abstract (translated)

过去十年中深度学习技术的快速发展,基于外观的 gaze 估计问题吸引了计算机视觉和人机交互研究社区的广泛关注。提出了各种有趣的方法,包括软注意力、硬注意力、双眼不对称、特征分离、旋转一致性和对比学习等。这些方法大多数只使用了单个面部或多个区域作为输入,但 gaze 估计的基本架构却没有得到充分的探索。在本文中,我们揭示了一个事实,即通过调整 ResNet 架构中的几个简单参数,可以在三个最受欢迎的数据集上比大多数现有的方法在 gaze 估计任务中表现更好。通过广泛的实验,我们得出结论, stride 数、输入图像分辨率和多区域架构对于 gaze 估计性能至关重要,而它们的 effectiveness 取决于输入面部图像的质量。我们通过使用 ResNet-50 作为主干网络,在 ETH-XGaze、MPIIFaceGaze 和 gaze360 degrees gaze 估计误差这三个数据集上获得了最先进的性能,其中 ETH-XGaze 的数据集达到 3.64,MPIIFaceGaze 的数据集达到 4.50, gaze360 degrees 的数据集达到 9.13。

URL

https://arxiv.org/abs/2308.09593

PDF

https://arxiv.org/pdf/2308.09593.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot