Deep learning has bolstered gaze estimation techniques, but real-world deployment has been impeded by inadequate training datasets. This problem is exacerbated by both hardware-induced variations in eye images and inherent biological differences across the recorded participants, leading to both feature and pixel-level variance that hinders the generalizability of models trained on specific datasets. While synthetic datasets can be a solution, their creation is both time and resource-intensive. To address this problem, we present a framework called Light Eyes or "LEyes" which, unlike conventional photorealistic methods, only models key image features required for video-based eye tracking using simple light distributions. LEyes facilitates easy configuration for training neural networks across diverse gaze-estimation tasks. We demonstrate that models trained using LEyes outperform other state-of-the-art algorithms in terms of pupil and CR localization across well-known datasets. In addition, a LEyes trained model outperforms the industry standard eye tracker using significantly more cost-effective hardware. Going forward, we are confident that LEyes will revolutionize synthetic data generation for gaze estimation models, and lead to significant improvements of the next generation video-based eye trackers.
深度学习已经加强了视点估计技术,但在实际部署方面,缺乏训练数据的问题一直阻碍着进展。这个问题由硬件引起的眼睛图像变异以及记录参与者之间的固有生物学差异加剧,导致特征和像素级别的差异,从而妨碍了特定数据集训练模型的泛化能力。虽然合成数据可以是一种解决方案,但创建它们既需要时间又需要资源。为了解决这一问题,我们提出了名为“光眼”或“LEyes”的框架,它与传统的逼真方法不同,仅使用简单的光照分布来模型关键图像特征,以用于视频跟踪的眼睛定位。LEyes方便地配置用于训练多种视点估计任务神经网络。我们证明,使用LEyes训练的模型在著名的数据集上的 pupils 和 CR 定位方面优于其他最先进的算法。此外,使用更经济实惠的硬件训练的LEyes模型比行业标准眼跟踪器表现更好。未来,我们有信心,LEyes将彻底改变合成数据生成用于视点估计模型,并导致下一代视频式眼跟踪器的重大改进。
https://arxiv.org/abs/2309.06129
Appearance-based gaze estimation has shown great promise in many applications by using a single general-purpose camera as the input device. However, its success is highly depending on the availability of large-scale well-annotated gaze datasets, which are sparse and expensive to collect. To alleviate this challenge we propose ConGaze, a contrastive learning-based framework that leverages unlabeled facial images to learn generic gaze-aware representations across subjects in an unsupervised way. Specifically, we introduce the gaze-specific data augmentation to preserve the gaze-semantic features and maintain the gaze consistency, which are proven to be crucial for effective contrastive gaze representation learning. Moreover, we devise a novel subject-conditional projection module that encourages a share feature extractor to learn gaze-aware and generic representations. Our experiments on three public gaze estimation datasets show that ConGaze outperforms existing unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6% improvement over its supervised learning-based counterpart in cross-dataset evaluations.
基于外观的 gaze 估计在许多应用中表现出巨大的潜力,只需要使用一个通用的摄像头作为输入设备。然而,它的成功高度取决于大规模 well-annotated gaze 数据集的可用性,这些数据集非常稀疏且昂贵地收集。为了减轻这个挑战,我们提出了 ConGaze,一个对比学习为基础的框架,利用未标记的面部图像来在没有监督的情况下学习适用于不同对象的通用 gaze aware 表示。具体来说,我们引入了 gaze 特定的数据增强来保持 gaze 语义特征并维持 gaze 一致性,这些特性是有效的对比学习 gaze 表示学习的关键。此外,我们设计了一个新型的主题条件投影模块,鼓励共享特征提取器来学习 gaze aware 和通用表示。我们对三个公共 gaze 估计数据集的实验表明,ConGaze 在与其他数据集评估相比中表现优异,提高了6.7%到22.5%。
https://arxiv.org/abs/2309.04506
Although recent deep learning based gaze estimation approaches have achieved much improvement, we still know little about how gaze features are connected to the physics of gaze. In this paper, we try to answer this question by analyzing the gaze feature manifold. Our analysis revealed the insight that the geodesic distance between gaze features is consistent with the gaze differences between samples. According to this finding, we construct the Physics- Consistent Feature (PCF) in an analytical way, which connects gaze feature to the physical definition of gaze. We further propose the PCFGaze framework that directly optimizes gaze feature space by the guidance of PCF. Experimental results demonstrate that the proposed framework alleviates the overfitting problem and significantly improves cross-domain gaze estimation accuracy without extra training data. The insight of gaze feature has the potential to benefit other regression tasks with physical meanings.
尽管近年来基于深度学习的 gaze 估计方法已经取得了很大进展,但我们仍然对 gaze 特征与 gaze 物理之间的关系了解不足。在本文中,我们试图通过分析 gaze 特征万维网来回答这个问题。我们的分析揭示了 gaze 特征之间的葛氏距离与样本之间的 gaze 差异是一致的。根据这一发现,我们采用一种分析的方法来构建具有物理一致性的特征(PCF),该特征将 gaze 特征与 gaze 的物理定义联系起来。我们还提出了 PCFGaze 框架,该框架通过 PCF 的指导直接优化 gaze 特征空间。实验结果显示,该框架可以减轻过拟合问题并显著改善跨域的 gaze 估计准确性,而 gaze 特征的意识能够为其他具有物理意义的回归任务带来好处。
https://arxiv.org/abs/2309.02165
Gaze estimation methods estimate gaze from facial appearance with a single camera. However, due to the limited view of a single camera, the captured facial appearance cannot provide complete facial information and thus complicate the gaze estimation problem. Recently, camera devices are rapidly updated. Dual cameras are affordable for users and have been integrated in many devices. This development suggests that we can further improve gaze estimation performance with dual-view gaze estimation. In this paper, we propose a dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze directions from a pair of images. We first propose a dual-view interactive convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information during convolution in multiple feature scales. It fuses dual-view features along epipolar lines and compensates for the original feature with the fused feature. We further propose a dual-view transformer to estimate gaze from dual-view features. Camera poses are encoded to indicate the position information in the transformer. We also consider the geometric relation between dual-view gaze directions and propose a dual-view gaze consistency loss for DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE datasets. Our experiments also prove the potential of dual-view gaze estimation. We release codes in this https URL.
视觉定位方法使用单个相机捕捉面部外观来估计目光方向。然而,由于单个相机的视角有限,捕捉到的面部外观无法提供完整的面部信息,从而增加了视觉定位问题的复杂性。近年来,相机设备不断更新。双镜头设备为用户提供了成本效益,已经被广泛应用于许多设备中。这种发展趋势表明,我们可以使用双视图视觉定位方法进一步改善视觉定位性能。在本文中,我们提出了一种双视图视觉定位网络(DV-Gaze),该网络从一对图像中估计双视图的目光方向。我们首先在DV-Gaze中提出了双视图交互卷积块(DIC)。DIC块在多个特征尺寸上的卷积中交换双视图信息。它将双视图特征在极线方向上融合,并补偿原始特征与融合特征。我们还提出了一种双视图Transformer用于从双视图特征估计目光。相机姿态编码为表示Transformer中的位置信息。我们还考虑了双视图目光方向之间的几何关系,并提出了DV-Gaze的双视图目光一致性损失。DV-Gaze在ETH-XGaze和EVE数据集上实现了最先进的性能。我们的实验也证明了双视图视觉定位的潜力。我们在本URL中发布了代码。
https://arxiv.org/abs/2308.10310
With the rapid development of deep learning technology in the past decade, appearance-based gaze estimation has attracted great attention from both computer vision and human-computer interaction research communities. Fascinating methods were proposed with variant mechanisms including soft attention, hard attention, two-eye asymmetry, feature disentanglement, rotation consistency, and contrastive learning. Most of these methods take the single-face or multi-region as input, yet the basic architecture of gaze estimation has not been fully explored. In this paper, we reveal the fact that tuning a few simple parameters of a ResNet architecture can outperform most of the existing state-of-the-art methods for the gaze estimation task on three popular datasets. With our extensive experiments, we conclude that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance while their effectiveness dependent on the quality of the input face image. We obtain the state-of-the-art performances on three datasets with 3.64 on ETH-XGaze, 4.50 on MPIIFaceGaze, and 9.13 on Gaze360 degrees gaze estimation error by taking ResNet-50 as the backbone.
过去十年中深度学习技术的快速发展,基于外观的 gaze 估计问题吸引了计算机视觉和人机交互研究社区的广泛关注。提出了各种有趣的方法,包括软注意力、硬注意力、双眼不对称、特征分离、旋转一致性和对比学习等。这些方法大多数只使用了单个面部或多个区域作为输入,但 gaze 估计的基本架构却没有得到充分的探索。在本文中,我们揭示了一个事实,即通过调整 ResNet 架构中的几个简单参数,可以在三个最受欢迎的数据集上比大多数现有的方法在 gaze 估计任务中表现更好。通过广泛的实验,我们得出结论, stride 数、输入图像分辨率和多区域架构对于 gaze 估计性能至关重要,而它们的 effectiveness 取决于输入面部图像的质量。我们通过使用 ResNet-50 作为主干网络,在 ETH-XGaze、MPIIFaceGaze 和 gaze360 degrees gaze 估计误差这三个数据集上获得了最先进的性能,其中 ETH-XGaze 的数据集达到 3.64,MPIIFaceGaze 的数据集达到 4.50, gaze360 degrees 的数据集达到 9.13。
https://arxiv.org/abs/2308.09593
With the escalated demand of human-machine interfaces for intelligent systems, development of gaze controlled system have become a necessity. Gaze, being the non-intrusive form of human interaction, is one of the best suited approach. Appearance based deep learning models are the most widely used for gaze estimation. But the performance of these models is entirely influenced by the size of labeled gaze dataset and in effect affects generalization in performance. This paper aims to develop a semi-supervised contrastive learning framework for estimation of gaze direction. With a small labeled gaze dataset, the framework is able to find a generalized solution even for unseen face images. In this paper, we have proposed a new contrastive loss paradigm that maximizes the similarity agreement between similar images and at the same time reduces the redundancy in embedding representations. Our contrastive regression framework shows good performance in comparison to several state of the art contrastive learning techniques used for gaze estimation.
随着对智能系统的人机交互需求不断增加, gaze控制系统已经成为必要的开发方向。 gaze 是人类交互的非侵入形式,因此是一种非常适合的方法。基于外观的深度学习模型是 gaze 估计最常用的方法之一。但这些模型的性能完全取决于标记的 gaze 数据集的大小,并且实际上会影响其性能的泛化能力。本文旨在开发一个半监督的对比学习框架,用于估计 gaze 方向。只要有少量的标记 gaze 数据集,框架就能够为 unseen 的面容图像找到通用的解决方案。本文提出了一种新的对比损失范式,该范式最大化相似图像之间的相似度一致性,同时减少嵌入表示中的冗余。我们的对比回归框架相对于用于 gaze 估计的一些最先进的对比学习技术表现出良好的性能。
https://arxiv.org/abs/2308.02784
Face rendering using neural radiance fields (NeRF) is a rapidly developing research area in computer vision. While recent methods primarily focus on controlling facial attributes such as identity and expression, they often overlook the crucial aspect of modeling eyeball rotation, which holds importance for various downstream tasks. In this paper, we aim to learn a face NeRF model that is sensitive to eye movements from multi-view images. We address two key challenges in eye-aware face NeRF learning: how to effectively capture eyeball rotation for training and how to construct a manifold for representing eyeball rotation. To accomplish this, we first fit FLAME, a well-established parametric face model, to the multi-view images considering multi-view consistency. Subsequently, we introduce a new Dynamic Eye-aware NeRF (DeNeRF). DeNeRF transforms 3D points from different views into a canonical space to learn a unified face NeRF model. We design an eye deformation field for the transformation, including rigid transformation, e.g., eyeball rotation, and non-rigid transformation. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our model is capable of generating high-fidelity images with accurate eyeball rotation and non-rigid periocular deformation, even under novel viewing angles. Furthermore, we show that utilizing the rendered images can effectively enhance gaze estimation performance.
使用神经网络辐射场(NeRF)进行人脸渲染是一个快速发展的计算机视觉研究领域。尽管最近的方法主要关注控制面部属性,如身份和表达,但它们往往忽略了 Modeling eye rotation 的关键方面,这对于各种后续任务来说非常重要。在本文中,我们旨在学习一种能够敏感地从多视角图像中捕获眼部运动的人脸NeRF模型。我们解决了两个关键挑战,即如何在眼动aware人脸NeRF学习中有效地捕捉眼部运动,以及如何构建代表眼部运动的支集。为了实现这一点,我们首先考虑将 established 参数化人脸模型FLAME 应用于多视角图像,并考虑多视角一致性。随后,我们介绍了一种新的动态眼动awareNeRF(DeNeRF),DeNeRF将来自不同视角的3D点转换为一个标准空间,以学习一个统一的人脸NeRF模型。我们设计了用于变换的眼部变形场,包括Rigid 变换,如眼部旋转,和非Rigid 变换。通过在ETH-XGaze数据集上进行实验,我们证明,我们的模型能够生成高精度的眼部旋转和非RigidPeriocular变形的图像,即使在独特的视角下。此外,我们表明,利用渲染图像能够有效地增强眼动估计性能。
https://arxiv.org/abs/2308.00773
Driver gaze plays an important role in different gaze-based applications such as driver attentiveness detection, visual distraction detection, gaze behavior understanding, and building driver assistance system. The main objective of this study is to perform a comprehensive summary of driver gaze fundamentals, methods to estimate driver gaze, and it's applications in real world driving scenarios. We first discuss the fundamentals related to driver gaze, involving head-mounted and remote setup based gaze estimation and the terminologies used for each of these data collection methods. Next, we list out the existing benchmark driver gaze datasets, highlighting the collection methodology and the equipment used for such data collection. This is followed by a discussion of the algorithms used for driver gaze estimation, which primarily involves traditional machine learning and deep learning based techniques. The estimated driver gaze is then used for understanding gaze behavior while maneuvering through intersections, on-ramps, off-ramps, lane changing, and determining the effect of roadside advertising structures. Finally, we have discussed the limitations in the existing literature, challenges, and the future scope in driver gaze estimation and gaze-based applications.
司机的目光在各种不同的 gaze based 应用中发挥着重要作用,例如司机注意力检测、视觉干扰检测、目光行为理解以及构建司机辅助系统。本研究的主要目标是对司机的目光 fundamentals、估计司机目光的方法以及它在真实世界驾驶场景中的应用进行 comprehensive summary。我们首先讨论了与司机的目光相关的 fundamentals,包括基于头戴式和远程setup 的目光估算方法,以及这些数据收集方法所使用的术语。接下来,我们列出了现有的司机目光基准数据集,重点介绍了这些数据的收集方法和所使用的设备。这之后,我们讨论了用于司机目光估算的算法,这主要涉及传统的机器学习和深度学习 based 技术。估计的司机目光被用来理解目光行为,在穿过路口、入口、出口、换车道以及确定路边广告结构的影响时进行操纵。最后,我们讨论了现有文献中的限制、挑战以及司机目光估算和 gaze-based 应用程序的未来 scope。
https://arxiv.org/abs/2307.01470
In recent years we have witnessed an increasing number of interactive systems on handheld mobile devices which utilise gaze as a single or complementary interaction modality. This trend is driven by the enhanced computational power of these devices, higher resolution and capacity of their cameras, and improved gaze estimation accuracy obtained from advanced machine learning techniques, especially in deep learning. As the literature is fast progressing, there is a pressing need to review the state of the art, delineate the boundary, and identify the key research challenges and opportunities in gaze estimation and interaction. This paper aims to serve this purpose by presenting an end-to-end holistic view in this area, from gaze capturing sensors, to gaze estimation workflows, to deep learning techniques, and to gaze interactive applications.
近年来,我们在手持移动设备上见证了越来越多的交互系统,这些系统利用视口作为单一或补充交互方式。这一趋势是由这些设备的增强计算能力、其摄像头的高分辨率和容量以及从高级机器学习技术中获得的改善视口估计精度所获得的提高所驱动的。随着文献的快速进展,迫切需要审查最新进展、确定边界并识别视口估计和交互方面的 key research挑战和机会。本 paper 旨在实现这一目标,通过呈现这一领域的 end-to-end 整体视角,从视口捕捉传感器到视口估计工作流程、再到深度学习技术和视口交互应用程序。
https://arxiv.org/abs/2307.00122
Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play a significant role in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, we show that the model only using monocular-reconstructed synthetic training data can perform comparably to real data with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at \url{this https URL}.
与深度学习网络的最新发展相伴,基于外观的 gaze 估计在同域内的训练和测试中取得了显著成功。相对于同域任务,不同域的差异使跨域性能严重下降,从而阻止 gaze 估计在现实世界中的应用。在所有因素中,头部姿态和注视范围被认为是 gaze 估计最终性能的重要影响因素,而收集大量数据的成本很高。该工作提出了一种有效的模型训练 pipeline,包括一个训练数据合成和 gaze 估计模型的无监督跨域适应方法。该合成方法利用单图像三维重构扩大源域头部姿态范围,而不需要三维面部形状数据集。为了弥补合成和真实图像之间的必然差距,我们进一步提出了适合合成全貌数据的无监督跨域适应方法。我们提出了一个分离注意力相关的特征的解码网络,并引入背景增强一致性损失,利用合成源域的特点。通过综合实验,我们表明,仅使用单眼重构的合成训练数据可以使用与大量标签范围的真实数据相当的性能。我们提出的跨域适应方法进一步改进了多个目标域的性能。代码和数据将可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2305.16140
Latest developments in computer hardware, sensor technologies, and artificial intelligence can make virtual reality (VR) and virtual spaces an important part of human everyday life. Eye tracking offers not only a hands-free way of interaction but also the possibility of a deeper understanding of human visual attention and cognitive processes in VR. Despite these possibilities, eye-tracking data also reveal privacy-sensitive attributes of users when it is combined with the information about the presented stimulus. To address these possibilities and potential privacy issues, in this survey, we first cover major works in eye tracking, VR, and privacy areas between the years 2012 and 2022. While eye tracking in the VR part covers the complete pipeline of eye-tracking methodology from pupil detection and gaze estimation to offline use and analyses, as for privacy and security, we focus on eye-based authentication as well as computational methods to preserve the privacy of individuals and their eye-tracking data in VR. Later, taking all into consideration, we draw three main directions for the research community by mainly focusing on privacy challenges. In summary, this survey provides an extensive literature review of the utmost possibilities with eye tracking in VR and the privacy implications of those possibilities.
计算机硬件、传感器技术和人工智能的最新发展可以使得虚拟现实(VR)和虚拟空间成为人类日常生活中的一个重要部分。眼动追踪不仅提供了无触摸交互的方式,而且有可能更深入地理解VR中的人类视觉注意力和认知过程。尽管这些可能性,但眼动追踪数据在与呈现刺激信息结合时也会揭示用户的敏感隐私属性。为了解决这些可能性和潜在的隐私问题,本调查我们首先覆盖了2012年至2022年间眼动追踪、VR和隐私领域的主要工作。在VR部分,眼动追踪涵盖了从瞳孔检测和 gaze 估计到离线使用和分析的完整眼动追踪方法 pipeline。对于隐私和安全,我们重点探讨基于眼的认证以及计算方法,以保护个人及其在VR中的眼动追踪数据的隐私。后来,综合考虑所有因素,我们确定了研究社区的三个主要方向,主要关注隐私挑战。综上所述,本调查提供了在VR中眼动追踪的最大可能性以及这些可能性所涉及的隐私影响的全面文献综述。
https://arxiv.org/abs/2305.14080
Appearance-based gaze estimation has been actively studied in recent years. However, its generalization performance for unseen head poses is still a significant limitation for existing methods. This work proposes a generalizable multi-view gaze estimation task and a cross-view feature fusion method to address this issue. In addition to paired images, our method takes the relative rotation matrix between two cameras as additional input. The proposed network learns to extract rotatable feature representation by using relative rotation as a constraint and adaptively fuses the rotatable features via stacked fusion modules. This simple yet efficient approach significantly improves generalization performance under unseen head poses without significantly increasing computational cost. The model can be trained with random combinations of cameras without fixing the positioning and can generalize to unseen camera pairs during inference. Through experiments using multiple datasets, we demonstrate the advantage of the proposed method over baseline methods, including state-of-the-art domain generalization approaches.
外观基于 gaze 估计在近年来得到了 actively 的研究,但是对于未知的头部姿势 generalization 性能仍然是现有方法的一个 significant 限制。该工作提出了一个可扩展的多角度 gaze 估计任务和交叉视角特征融合方法来解决这个问题。除了配对图像,我们的方法还使用两个相机之间的相对旋转矩阵作为额外的输入。 proposed 网络通过学习使用相对旋转作为约束来提取可旋转的特征表示,并通过堆叠融合模块自适应地融合这些可旋转的特征。这个简单但高效的方法在未知的头部姿势下显著改善了 generalization 性能,而无需显著增加计算成本。模型可以使用随机组合的相机组合进行训练,而可以在推理期间 generalization 到未知的相机对。通过使用多个数据集进行实验,我们证明了该方法相对于基准方法的优势,包括先进的领域扩展方法。
https://arxiv.org/abs/2305.12704
Learning-based gaze estimation methods require large amounts of training data with accurate gaze annotations. Facing such demanding requirements of gaze data collection and annotation, several image synthesis methods were proposed, which successfully redirected gaze directions precisely given the assigned conditions. However, these methods focused on changing gaze directions of the images that only include eyes or restricted ranges of faces with low resolution (less than $128\times128$) to largely reduce interference from other attributes such as hairs, which limits application scenarios. To cope with this limitation, we proposed a portable network, called ReDirTrans, achieving latent-to-latent translation for redirecting gaze directions and head orientations in an interpretable manner. ReDirTrans projects input latent vectors into aimed-attribute embeddings only and redirects these embeddings with assigned pitch and yaw values. Then both the initial and edited embeddings are projected back (deprojected) to the initial latent space as residuals to modify the input latent vectors by subtraction and addition, representing old status removal and new status addition. The projection of aimed attributes only and subtraction-addition operations for status replacement essentially mitigate impacts on other attributes and the distribution of latent vectors. Thus, by combining ReDirTrans with a pretrained fixed e4e-StyleGAN pair, we created ReDirTrans-GAN, which enables accurately redirecting gaze in full-face images with $1024\times1024$ resolution while preserving other attributes such as identity, expression, and hairstyle. Furthermore, we presented improvements for the downstream learning-based gaze estimation task, using redirected samples as dataset augmentation.
基于学习的 gaze 估计方法需要大量准确的 gaze 标注数据。为了满足 gaze 数据收集和标注的严格要求,提出了几种图像合成方法,这些方法成功地改变了仅包含眼睛或限制范围较大的面部图像的 gaze 方向,以尽量减少与其他属性如头发等的干扰,从而限制了应用场景。为了应对这一限制,我们提出了一种便携式网络,称为 ReDirTrans,可以实现隐态到隐态的翻译,以可解释的方式 redirect gaze 方向和头向, ReDirTrans 只将输入隐态向量映射到目标属性嵌入向量,并使用指定的 pitch 和 yaw 值 redirect 这些向量。然后,初始和编辑的嵌入向量都被投影回初始隐空间,作为残留值,通过减去和加法修改输入隐向量,表示删除旧状态并添加新状态。仅投影目标属性和用于状态替换的减去加法操作实际上减缓了对其他属性和隐向量分布的影响。因此,通过结合预训练的固定 e4e-StyleGAN 对组合,我们创造了 ReDirTrans-GAN,该网络可以在 $1024\times1024$ 分辨率的全景图像中准确地 redirect gaze,同时保留其他属性,如身份、表现和发型。此外,我们还提出了基于下游学习 gaze 估计任务的进步,使用重新引导样本作为数据增强。
https://arxiv.org/abs/2305.11452
Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Cropping results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch cropping process is expensive, erroneous, and implementation-specific for different methods. In this paper, we propose a frame-to-gaze network that directly predicts both 3D gaze origin and 3D gaze direction from the raw frame out of the camera without any face or eye cropping. Our method demonstrates that direct gaze regression from the raw downscaled frame, from FHD/HD to VGA/HVGA resolution, is possible despite the challenges of having very few pixels in the eye region. The proposed method achieves comparable results to state-of-the-art methods in Point-of-Gaze (PoG) estimation on three public gaze datasets: GazeCapture, MPIIFaceGaze, and EVE, and generalizes well to extreme camera view changes.
尽管近年来基于学习的目光估计方法有所发展,但大多数方法都要求至少一个眼睛或面部区域裁剪作为输入,并产生一个目光方向向量作为输出。裁剪在眼睛区域提高分辨率,减少混淆因素(如服装和头发)被认为有助于最终模型性能的提升。然而,这种眼睛或面部区域裁剪过程对于不同方法来说成本较高、错误率较高,并且实现特定的。在本文中,我们提出了一种帧到目光网络,它从相机的原始帧中直接预测3D目光起源和3D目光方向,而不需要任何眼睛或面部裁剪。我们的方法证明了尽管眼睛区域只有很少的像素,但从原始 downscaled 帧直接恢复目光方向是可能的。我们的方法在三个公共目光数据集上: gazeCapture、MPIIFaceGaze 和 EVE 上实现了与最先进的目光点(PoG)估计方法相当的结果,并能够很好地适应极端相机视图变化。
https://arxiv.org/abs/2305.05526
Gaze estimation is a crucial task in computer vision, however, existing methods suffer from high computational costs, which limit their practical deployment in resource-limited environments. In this paper, we propose a novel lightweight model, FR-Net, for accurate gaze angle estimation while significantly reducing computational complexity. FR-Net utilizes the Fast Fourier Transform (FFT) to extract gaze-relevant features in frequency domains while reducing the number of parameters. Additionally, we introduce a shortcut component that focuses on the spatial domain to further improve the accuracy of our model. Our experimental results demonstrate that our approach achieves substantially lower gaze error angles (3.86 on MPII and 4.51 on EYEDIAP) compared to state-of-the-art gaze estimation methods, while utilizing 17 times fewer parameters (0.67M) and only 12\% of FLOPs (0.22B). Furthermore, our method outperforms existing lightweight methods in terms of accuracy and efficiency for the gaze estimation task. These results suggest that our proposed approach has significant potential applications in areas such as human-computer interaction and driver assistance systems.
视觉中的注视估算是一个关键的任务,但现有的方法面临着高计算成本的限制,这限制了它们在资源有限的环境中的实际应用。在本文中,我们提出了一种新颖的轻量化模型,FR-Net,以高精度的注视角度估算而显著降低计算复杂度。FR-Net利用快速傅里叶变换(FFT)在频率域中提取注视相关特征,同时减少参数数量。此外,我们引入了一个捷径组件,重点聚焦于空间域,进一步提高了我们的模型的准确性。我们的实验结果显示,与我们最先进的注视估算方法相比,我们的方法能够实现显著的更低的注视误差角度(MPII为3.86,eyeDIAP为4.51),同时使用更少的参数(0.67M)和只有12\%的FLOPs(0.22B)。此外,我们的方法在注视估算任务的准确性和效率方面都超越了现有的轻量化方法。这些结果表明,我们提出的这种方法在人机交互和自动驾驶系统等领域具有巨大的潜在应用。
https://arxiv.org/abs/2305.11875
We present a deep learning method for accurately localizing the center of a single corneal reflection (CR) in an eye image. Unlike previous approaches, we use a convolutional neural network (CNN) that was trained solely using simulated data. Using only simulated data has the benefit of completely sidestepping the time-consuming process of manual annotation that is required for supervised training on real eye images. To systematically evaluate the accuracy of our method, we first tested it on images with simulated CRs placed on different backgrounds and embedded in varying levels of noise. Second, we tested the method on high-quality videos captured from real eyes. Our method outperformed state-of-the-art algorithmic methods on real eye images with a 35% reduction in terms of spatial precision, and performed on par with state-of-the-art on simulated images in terms of spatial accuracy.We conclude that our method provides a precise method for CR center localization and provides a solution to the data availability problem which is one of the important common roadblocks in the development of deep learning models for gaze estimation. Due to the superior CR center localization and ease of application, our method has the potential to improve the accuracy and precision of CR-based eye trackers
我们提出了一种深度学习方法,用于准确地定位眼睛图像中的单个角膜反射(CR)的中心。与以前的教学方法不同,我们使用了卷积神经网络(CNN),该网络仅使用模拟数据进行训练。仅使用模拟数据的好处是完全避免了在真实眼睛图像上进行手动标注所需的繁琐过程。为了系统地评估我们的方法的准确性,我们首先测试了在不同背景上放置的模拟CR和嵌入不同水平的噪声的图像。其次,我们测试了从真实眼睛中捕获的高质量的视频。我们的方法在真实眼睛图像上比最先进的算法方法表现更好,在空间精度方面下降了35%。在模拟图像方面,我们的方法和最先进的方法在空间精度上表现相似。我们的结论是,我们的方法提供了一种精确的CR中心定位方法,并为解决数据可用性问题提供了解决方案,这个问题是深度学习模型 gaze估计开发中的一个重要障碍。由于我们的方法优异的CR中心定位能力和简单易用,它可能有潜力改进基于CR的眼睛跟踪器的精度和精度。
https://arxiv.org/abs/2304.05673
Gaze tracking is a valuable tool with a broad range of applications in various fields, including medicine, psychology, virtual reality, marketing, and safety. Therefore, it is essential to have gaze tracking software that is cost-efficient and high-performing. Accurately predicting gaze remains a difficult task, particularly in real-world situations where images are affected by motion blur, video compression, and noise. Super-resolution has been shown to improve image quality from a visual perspective. This work examines the usefulness of super-resolution for improving appearance-based gaze tracking. We show that not all SR models preserve the gaze direction. We propose a two-step framework based on SwinIR super-resolution model. The proposed method consistently outperforms the state-of-the-art, particularly in scenarios involving low-resolution or degraded images. Furthermore, we examine the use of super-resolution through the lens of self-supervised learning for gaze prediction. Self-supervised learning aims to learn from unlabelled data to reduce the amount of required labeled data for downstream tasks. We propose a novel architecture called SuperVision by fusing an SR backbone network to a ResNet18 (with some skip connections). The proposed SuperVision method uses 5x less labeled data and yet outperforms, by 15%, the state-of-the-art method of GazeTR which uses 100% of training data.
注视 tracking是一类重要的工具,在各种领域都有广泛的应用,包括医学、心理学、虚拟现实、营销和安全等。因此,拥有高效且性能优秀的注视 tracking软件是至关重要的。精确预测注视仍然是一项具有挑战性的任务,特别是在图像受到运动模糊、视频压缩和噪声影响的实际场景下。超分辨率已经被证明可以改善图像质量。本研究探讨了超分辨率如何用于改善外观基于注视 tracking。我们表明,不是所有的SR模型都保留注视方向。我们提出了基于 SwinIR 超分辨率模型的两步框架。该框架 consistently outperforms the state-of-the-art,特别是在涉及低分辨率或图像恶化的场景下。此外,我们考虑了通过自监督学习视角使用超分辨率进行注视预测的必要性。自监督学习旨在从未标记数据学习以减少后续任务所需的标记数据量。我们提出了一种名为SuperVision的新架构,通过将SR基线网络与ResNet18融合(并添加一些跳过连接)来实现。该SuperVision方法使用更少的标记数据,但仍比使用训练数据的100%的 gazeTR方法表现更好,提高了15%。
https://arxiv.org/abs/2303.10151
Deep learning appearance-based 3D gaze estimation is gaining popularity due to its minimal hardware requirements and being free of constraint. Unreliable and overconfident inferences, however, still limit the adoption of this gaze estimation method. To address the unreliable and overconfident issues, we introduce a confidence-aware model that predicts uncertainties together with gaze angle estimations. We also introduce a novel effectiveness evaluation method based on the causality between eye feature degradation and the rise in inference uncertainty to assess the uncertainty estimation. Our confidence-aware model demonstrates reliable uncertainty estimations while providing angular estimation accuracies on par with the state-of-the-art. Compared with the existing statistical uncertainty-angular-error evaluation metric, the proposed effectiveness evaluation approach can more effectively judge inferred uncertainties' performance at each prediction.
Deep learning 基于外观的三维目光估计因其硬件要求最小且不受限制而越来越受欢迎。然而,不可靠的和过于自信的决策仍然限制着这种方法的采用。为了解决这些不可靠和过于自信的问题,我们引入了一种具有自我意识的模型,它可以同时预测不确定性和目光角度估计。我们还介绍了一种基于 eye feature 退化和推断不确定性的上升的因果关系的新有效性评估方法,以评估不确定性估计。我们的具有自我意识的模型表现出可靠的不确定性估计,同时提供与当前最先进的准确性相当的角估计精度。与现有的统计不确定性-角误差评估度量相比,我们提出的有效性评估方法可以更有效地评估推断不确定性在每个预测中的性能。
https://arxiv.org/abs/2303.10062
Appearance-based gaze estimation systems have shown great progress recently, yet the performance of these techniques depend on the datasets used for training. Most of the existing gaze estimation datasets setup in interactive settings were recorded in laboratory conditions and those recorded in the wild conditions display limited head pose and illumination variations. Further, we observed little attention so far towards precision evaluations of existing gaze estimation approaches. In this work, we present a large gaze estimation dataset, PARKS-Gaze, with wider head pose and illumination variation and with multiple samples for a single Point of Gaze (PoG). The dataset contains 974 minutes of data from 28 participants with a head pose range of 60 degrees in both yaw and pitch directions. Our within-dataset and cross-dataset evaluations and precision evaluations indicate that the proposed dataset is more challenging and enable models to generalize on unseen participants better than the existing in-the-wild datasets. The project page can be accessed here: this https URL
appearance-based gaze estimation systems 在最近取得了巨大的进展,但这些技巧的性能取决于用于训练的 datasets。大多数现有的 gaze estimation dataset 在互动设置中设置是在实验室条件下录制的,而在野生条件下录制的dataset 显示 head pose 和照明变化非常有限。此外,我们观察到迄今为止 little 关注于 precision 评估 existing gaze estimation approaches。在本研究中,我们提出了一个大型 gaze estimation dataset,PARKS-Gaze,具有更广泛的 head pose 和照明变化,并且具有多个样本,以单点 gaze (PoG)。dataset 包含 974 分钟的数据,来自 28 名参与者,其 head pose 范围在 yaw 和 pitch 方向上为 60 度。我们的dataset 内部和跨dataset 评估以及 precision 评估表明, proposed dataset 更具挑战性,使模型能够对未观测到的参与者更普遍地泛化。项目页面可访问此处:这个 https URL。
https://arxiv.org/abs/2302.02353
Rather than regressing gaze direction directly from images, we show that adding a 3D shape model can: i) improve gaze estimation accuracy, ii) perform well with lower resolution inputs and iii) provide a richer understanding of the eye-region and its constituent gaze system. Specifically, we use an `eyes and nose' 3D morphable model (3DMM) to capture the eye-region 3D facial geometry and appearance and we equip this with a geometric vergence model of gaze to give an `active-gaze 3DMM'. We show that our approach achieves state-of-the-art results on the Eyediap dataset and we present an ablation study. Our method can learn with only the ground truth gaze target point and the camera parameters, without access to the ground truth gaze origin points, thus widening the applicability of our approach compared to other methods.
https://arxiv.org/abs/2301.13186