Paper Reading AI Learner

EFE: End-to-end Frame-to-Gaze Estimation

2023-05-09 15:25:45
Haldun Balim, Seonwook Park, Xi Wang, Xucong Zhang, Otmar Hilliges

Abstract

Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Cropping results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch cropping process is expensive, erroneous, and implementation-specific for different methods. In this paper, we propose a frame-to-gaze network that directly predicts both 3D gaze origin and 3D gaze direction from the raw frame out of the camera without any face or eye cropping. Our method demonstrates that direct gaze regression from the raw downscaled frame, from FHD/HD to VGA/HVGA resolution, is possible despite the challenges of having very few pixels in the eye region. The proposed method achieves comparable results to state-of-the-art methods in Point-of-Gaze (PoG) estimation on three public gaze datasets: GazeCapture, MPIIFaceGaze, and EVE, and generalizes well to extreme camera view changes.

Abstract (translated)

尽管近年来基于学习的目光估计方法有所发展,但大多数方法都要求至少一个眼睛或面部区域裁剪作为输入,并产生一个目光方向向量作为输出。裁剪在眼睛区域提高分辨率,减少混淆因素(如服装和头发)被认为有助于最终模型性能的提升。然而,这种眼睛或面部区域裁剪过程对于不同方法来说成本较高、错误率较高,并且实现特定的。在本文中,我们提出了一种帧到目光网络,它从相机的原始帧中直接预测3D目光起源和3D目光方向,而不需要任何眼睛或面部裁剪。我们的方法证明了尽管眼睛区域只有很少的像素,但从原始 downscaled 帧直接恢复目光方向是可能的。我们的方法在三个公共目光数据集上: gazeCapture、MPIIFaceGaze 和 EVE 上实现了与最先进的目光点(PoG)估计方法相当的结果,并能够很好地适应极端相机视图变化。

URL

https://arxiv.org/abs/2305.05526

PDF

https://arxiv.org/pdf/2305.05526.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot