Abstract
Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Cropping results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch cropping process is expensive, erroneous, and implementation-specific for different methods. In this paper, we propose a frame-to-gaze network that directly predicts both 3D gaze origin and 3D gaze direction from the raw frame out of the camera without any face or eye cropping. Our method demonstrates that direct gaze regression from the raw downscaled frame, from FHD/HD to VGA/HVGA resolution, is possible despite the challenges of having very few pixels in the eye region. The proposed method achieves comparable results to state-of-the-art methods in Point-of-Gaze (PoG) estimation on three public gaze datasets: GazeCapture, MPIIFaceGaze, and EVE, and generalizes well to extreme camera view changes.
Abstract (translated)
尽管近年来基于学习的目光估计方法有所发展,但大多数方法都要求至少一个眼睛或面部区域裁剪作为输入,并产生一个目光方向向量作为输出。裁剪在眼睛区域提高分辨率,减少混淆因素(如服装和头发)被认为有助于最终模型性能的提升。然而,这种眼睛或面部区域裁剪过程对于不同方法来说成本较高、错误率较高,并且实现特定的。在本文中,我们提出了一种帧到目光网络,它从相机的原始帧中直接预测3D目光起源和3D目光方向,而不需要任何眼睛或面部裁剪。我们的方法证明了尽管眼睛区域只有很少的像素,但从原始 downscaled 帧直接恢复目光方向是可能的。我们的方法在三个公共目光数据集上: gazeCapture、MPIIFaceGaze 和 EVE 上实现了与最先进的目光点(PoG)估计方法相当的结果,并能够很好地适应极端相机视图变化。
URL
https://arxiv.org/abs/2305.05526