Paper Reading AI Learner

DVGaze: Dual-View Gaze Estimation

2023-08-20 16:14:22
Yihua Cheng, Feng Lu

Abstract

Gaze estimation methods estimate gaze from facial appearance with a single camera. However, due to the limited view of a single camera, the captured facial appearance cannot provide complete facial information and thus complicate the gaze estimation problem. Recently, camera devices are rapidly updated. Dual cameras are affordable for users and have been integrated in many devices. This development suggests that we can further improve gaze estimation performance with dual-view gaze estimation. In this paper, we propose a dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze directions from a pair of images. We first propose a dual-view interactive convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information during convolution in multiple feature scales. It fuses dual-view features along epipolar lines and compensates for the original feature with the fused feature. We further propose a dual-view transformer to estimate gaze from dual-view features. Camera poses are encoded to indicate the position information in the transformer. We also consider the geometric relation between dual-view gaze directions and propose a dual-view gaze consistency loss for DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE datasets. Our experiments also prove the potential of dual-view gaze estimation. We release codes in this https URL.

Abstract (translated)

视觉定位方法使用单个相机捕捉面部外观来估计目光方向。然而,由于单个相机的视角有限,捕捉到的面部外观无法提供完整的面部信息,从而增加了视觉定位问题的复杂性。近年来,相机设备不断更新。双镜头设备为用户提供了成本效益,已经被广泛应用于许多设备中。这种发展趋势表明,我们可以使用双视图视觉定位方法进一步改善视觉定位性能。在本文中,我们提出了一种双视图视觉定位网络(DV-Gaze),该网络从一对图像中估计双视图的目光方向。我们首先在DV-Gaze中提出了双视图交互卷积块(DIC)。DIC块在多个特征尺寸上的卷积中交换双视图信息。它将双视图特征在极线方向上融合,并补偿原始特征与融合特征。我们还提出了一种双视图Transformer用于从双视图特征估计目光。相机姿态编码为表示Transformer中的位置信息。我们还考虑了双视图目光方向之间的几何关系,并提出了DV-Gaze的双视图目光一致性损失。DV-Gaze在ETH-XGaze和EVE数据集上实现了最先进的性能。我们的实验也证明了双视图视觉定位的潜力。我们在本URL中发布了代码。

URL

https://arxiv.org/abs/2308.10310

PDF

https://arxiv.org/pdf/2308.10310.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot