Abstract
Gaze estimation methods estimate gaze from facial appearance with a single camera. However, due to the limited view of a single camera, the captured facial appearance cannot provide complete facial information and thus complicate the gaze estimation problem. Recently, camera devices are rapidly updated. Dual cameras are affordable for users and have been integrated in many devices. This development suggests that we can further improve gaze estimation performance with dual-view gaze estimation. In this paper, we propose a dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze directions from a pair of images. We first propose a dual-view interactive convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information during convolution in multiple feature scales. It fuses dual-view features along epipolar lines and compensates for the original feature with the fused feature. We further propose a dual-view transformer to estimate gaze from dual-view features. Camera poses are encoded to indicate the position information in the transformer. We also consider the geometric relation between dual-view gaze directions and propose a dual-view gaze consistency loss for DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE datasets. Our experiments also prove the potential of dual-view gaze estimation. We release codes in this https URL.
Abstract (translated)
视觉定位方法使用单个相机捕捉面部外观来估计目光方向。然而,由于单个相机的视角有限,捕捉到的面部外观无法提供完整的面部信息,从而增加了视觉定位问题的复杂性。近年来,相机设备不断更新。双镜头设备为用户提供了成本效益,已经被广泛应用于许多设备中。这种发展趋势表明,我们可以使用双视图视觉定位方法进一步改善视觉定位性能。在本文中,我们提出了一种双视图视觉定位网络(DV-Gaze),该网络从一对图像中估计双视图的目光方向。我们首先在DV-Gaze中提出了双视图交互卷积块(DIC)。DIC块在多个特征尺寸上的卷积中交换双视图信息。它将双视图特征在极线方向上融合,并补偿原始特征与融合特征。我们还提出了一种双视图Transformer用于从双视图特征估计目光。相机姿态编码为表示Transformer中的位置信息。我们还考虑了双视图目光方向之间的几何关系,并提出了DV-Gaze的双视图目光一致性损失。DV-Gaze在ETH-XGaze和EVE数据集上实现了最先进的性能。我们的实验也证明了双视图视觉定位的潜力。我们在本URL中发布了代码。
URL
https://arxiv.org/abs/2308.10310