Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.
目光估计方法通常在评估时会受到不同领域的显著性能下降,因为测试和训练数据之间的领域差异。现有的方法尝试通过各种领域泛化方法来解决这个 issue,但效果仍然有限,因为目光数据集的多样性有限,如外观、可穿戴和图像质量等。为了克服这些限制,我们提出了一个名为 CLIP-Gaze 的新框架,它利用了一个预训练的视觉-语言模型来利用其可转移的知识。我们的框架是第一个利用视觉和语言跨模态方法来解决目光估计任务的。具体来说,我们通过将注意力相关的特征从与目光无关的特征中推开,从而实现了对语言描述可构建的灵活性。为了学习更合适的提示,我们提出了一个个性化的上下文优化方法来进行文本提示的调整。此外,我们还利用目光样本之间的关系来平滑目光相关特征的分布,从而提高 gaze 估计模型的泛化能力。大量实验证明,CLIP-Gaze 在四个跨领域评估中的表现超过了现有方法。
https://arxiv.org/abs/2403.05124
Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators' updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.
最新的目光估计方法需要大规模训练数据,但它们的收集和交换却存在着显著的隐私风险。我们提出PrivatEyes - 基于联邦学习和安全多方计算(MPC)的第一个隐私增强训练方法,用于基于外观的目光估计。PrivatEyes使多个局部数据集上的训练目光估计算法能够在不同的用户和服务器上进行训练,并对个人估计算法的更新进行安全聚合。PrivatEyes保证,即使大多数聚合服务器都是恶意的,个人目光数据也不会泄漏。我们还引入了一种新的数据泄露攻击DualView,证明了PrivatEyes比其他方法更有效地限制了训练数据的泄露。在MPIIGaze、MPIIFaceGaze、GazeCapture和NVGaze数据集上的评估进一步表明,提高隐私不会导致目光估计精度降低,或者导致计算成本大幅上升——这两者与非安全对照物的水平相当。
https://arxiv.org/abs/2402.18970
Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at this https URL.
目光物体预测的目标是预测人类观看的对象的位置和类别。 previous gaze object prediction 使用基于CNN的对象检测器预测对象的位置。然而,我们发现基于Transformer的对象检测器对于密集场景中的对象具有更准确的预测位置的能力。此外,Transformer的远距离建模能力可以帮助建立人头和目光物体之间的关系,这对于GOP任务非常重要。因此,本文将Transformer引入目光物体预测领域,并提出了一个端到端的Transformer-based gaze物体预测方法,名为TransGOP。具体来说,TransGOP使用了一个标准的Transformer-based物体检测器来检测物体的位置,并在目光回归器中设计了一个基于Transformer的 gaze 自动编码器,以建立远距离目光关系。此外,为了提高目光热图回归,我们提出了一个物体到目光物体的交叉注意力机制,让目光自动编码器的查询从物体检测器中学习全局记忆位置知识。最后,为了使整个框架端到端训练,我们提出了一个Gaze Box损失,通过增强目光物体的 gaze 热图能量,共同优化物体检测器和目光回归器。在GOO-Synth和GOO-Real数据集上的大量实验证明,我们的TransGOP在所有曲目上都实现了最先进的性能,即物体检测、目光估计和目光物体预测。我们的代码将在此处https:// URL上可用。
https://arxiv.org/abs/2402.13578
Gaze estimation, the task of predicting where an individual is looking, is a critical task with direct applications in areas such as human-computer interaction and virtual reality. Estimating the direction of looking in unconstrained environments is difficult, due to the many factors that can obscure the face and eye regions. In this work we propose CrossGaze, a strong baseline for gaze estimation, that leverages recent developments in computer vision architectures and attention-based modules. Unlike previous approaches, our method does not require a specialised architecture, utilizing already established models that we integrate in our architecture and adapt for the task of 3D gaze estimation. This approach allows for seamless updates to the architecture as any module can be replaced with more powerful feature extractors. On the Gaze360 benchmark, our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees. Our proposed model serves as a strong foundation for future research and development in gaze estimation, paving the way for practical and accurate gaze prediction in real-world scenarios.
凝视估计,预测个人正在看向的位置是一个关键任务,在领域如人机交互和虚拟现实中有直接应用。在约束环境中估计眼神方向很难,因为有许多因素会遮挡面部和眼睛区域。在这项工作中,我们提出了CrossGaze,一个强大的凝视估计基线,它利用了计算机视觉架构和基于注意力的模块的最近发展。与之前的方法不同,我们的方法不需要专门的架构,而是利用我们架构中已有的模型,并针对3D凝视估计进行适应。这种方法允许随着任何模块的更强大的特征提取器进行无缝更新。在Gaze360基准上,我们的模型超越了几个最先进的方法,实现了9.94度的平均角误差。我们提出的模型为未来凝视估计的研究和开发奠定了坚实的基础,为现实世界场景中的实际和准确凝视预测铺平了道路。
https://arxiv.org/abs/2402.08316
Advances in face swapping have enabled the automatic generation of highly realistic faces. Yet face swaps are perceived differently than when looking at real faces, with key differences in viewer behavior surrounding the eyes. Face swapping algorithms generally place no emphasis on the eyes, relying on pixel or feature matching losses that consider the entire face to guide the training process. We further investigate viewer perception of face swaps, focusing our analysis on the presence of an uncanny valley effect. We additionally propose a novel loss equation for the training of face swapping models, leveraging a pretrained gaze estimation network to directly improve representation of the eyes. We confirm that viewed face swaps do elicit uncanny responses from viewers. Our proposed improvements significant reduce viewing angle errors between face swaps and their source material. Our method additionally reduces the prevalence of the eyes as a deciding factor when viewers perform deepfake detection tasks. Our findings have implications on face swapping for special effects, as digital avatars, as privacy mechanisms, and more; negative responses from users could limit effectiveness in said applications. Our gaze improvements are a first step towards alleviating negative viewer perceptions via a targeted approach.
面部换脸技术的进步使得高度逼真的脸部生成成为可能。然而,与观察真实人脸时不同,面部换脸在观众行为方面存在不同的看法,主要是眼睛周围的观感差异。面部换脸算法通常没有强调眼睛,依赖像素或特征匹配损失,这些损失考虑整个面部来指导训练过程。我们进一步研究了观众对面部换脸的感知,重点关注是否存在令人不安的谷地效应。我们提出了一个新的损失方程来训练面部换脸模型,利用预训练的眼部估计网络直接提高眼睛的表示。我们证实,观众看到的面部换脸会引发观众的不安反应。我们提出的改进措施显著减少了面部换脸和其源材料之间的视角误差。我们的方法还减少了观众在进行深度伪造检测任务时作为决定性因素的眼球普遍性。我们的研究结果对面部换脸、特效、隐私机制等领域的应用具有影响。用户负面反应可能限制这些应用的有效性。我们的眼部改进是减轻观众不良感知的第一步,通过一种有针对性的方法。
https://arxiv.org/abs/2402.03188
Driver's gaze information can be crucial in driving research because of its relation to driver attention. Particularly, the inclusion of gaze data in driving simulators broadens the scope of research studies as they can relate drivers' gaze patterns to their features and performance. In this paper, we present two gaze region estimation modules integrated in a driving simulator. One uses the 3D Kinect device and another uses the virtual reality Oculus Rift device. The modules are able to detect the region, out of seven in which the driving scene was divided, where a driver is gazing at in every route processed frame. Four methods were implemented and compared for gaze estimation, which learn the relation between gaze displacement and head movement. Two are simpler and based on points that try to capture this relation and two are based on classifiers such as MLP and SVM. Experiments were carried out with 12 users that drove on the same scenario twice, each one with a different visualization display, first with a big screen and later with Oculus Rift. On the whole, Oculus Rift outperformed Kinect as the best hardware for gaze estimation. The Oculus-based gaze region estimation method with the highest performance achieved an accuracy of 97.94%. The information provided by the Oculus Rift module enriches the driving simulator data and makes it possible a multimodal driving performance analysis apart from the immersion and realism obtained with the virtual reality experience provided by Oculus.
驾驶员的目光信息对于驾驶研究至关重要,因为它与驾驶员注意力的关系。特别是,将目光数据纳入驾驶模拟器中扩大了研究研究的范围,因为它们可以关系驾驶员目光模式及其特征和表现。在本文中,我们介绍了一个驾驶模拟器中的两个目光区域估计模块。一个使用Kinect 3D设备,另一个使用Oculus Rift虚拟现实设备。这些模块能够检测出在处理每个路线的帧中,驾驶员在目光中所视的区域。我们比较了四种目光估计方法,它们学习了目光移动与头部移动之间的关系。两种方法更简单,基于试图捕捉这种关系的点,而两种方法基于像MLP和SVM这样的分类器。我们对12名用户在同一场景中进行了实验,每个人分别使用不同的可视化显示,先用大屏幕,然后使用Oculus Rift。总的来说,Oculus Rift超越了Kinect,成为目光估计的最佳硬件。基于Oculus的眼光区域估计方法达到97.94%的准确度。Oculus Rift模块提供的信息使驾驶模拟器数据更加丰富,使其能够进行多模态驾驶性能分析,而不仅仅是通过Oculus提供的虚拟现实体验获得的沉浸和现实感。
https://arxiv.org/abs/2402.05248
In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves an 8.7% improvement on Gaze360, rivals top MPIIFaceGaze results, and leads on a subset of ETH-XGaze by 13%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent's novel components. This approach has strong potential in human-robot interaction.
在这项研究中,我们提出了SLYKLatent,一种通过解决数据集中由于随机不确定性、协变量位移和测试域泛化导致的特征不稳定问题来增强注视估计的新型方法。SLYKLatent利用自监督学习对面部表情数据集进行初始训练,然后通过基于补丁的三分支网络和逆均方误差损失函数进行优化。我们在基准数据集上的评估实现了Gaze360的8.7%的改进,与顶级MPIIFaceGaze的结果相当,并且在ETH-XGaze上的份额增加了13%,超过了现有方法。在RAF-DB和Affectnet上的适应性测试显示,SLYKLatent的新组件分别获得了86.4%和60.9%的准确度。消融研究证实了SLYKLatent新组件的有效性。这种方法在人类机器人交互中具有很强的潜力。
https://arxiv.org/abs/2402.01555
Recently, appearance-based gaze estimation has been attracting attention in computer vision, and remarkable improvements have been achieved using various deep learning techniques. Despite such progress, most methods aim to infer gaze vectors from images directly, which causes overfitting to person-specific appearance factors. In this paper, we address these challenges and propose a novel framework: Stochastic subject-wise Adversarial gaZE learning (SAZE), which trains a network to generalize the appearance of subjects. We design a Face generalization Network (Fgen-Net) using a face-to-gaze encoder and face identity classifier and a proposed adversarial loss. The proposed loss generalizes face appearance factors so that the identity classifier inferences a uniform probability distribution. In addition, the Fgen-Net is trained by a learning mechanism that optimizes the network by reselecting a subset of subjects at every training step to avoid overfitting. Our experimental results verify the robustness of the method in that it yields state-of-the-art performance, achieving 3.89 and 4.42 on the MPIIGaze and EyeDiap datasets, respectively. Furthermore, we demonstrate the positive generalization effect by conducting further experiments using face images involving different styles generated from the generative model.
近年来,基于外观的注意力检测在计算机视觉领域引起了关注,并使用各种深度学习技术取得了显著的改进。然而,大多数方法旨在通过直接从图像中推断目光向量来获取人物特定外观因素,导致过拟合到个性化的外观因素。在本文中,我们解决了这些挑战,并提出了一个新颖的框架:随机主题的对抗性全局姿态学习(SAZE),它训练一个网络来推广主题。我们使用 face-to-gaze 编码器和一个 face identity 分类器来设计 Fgen-Net。所提出的损失将人脸外观因素扩展为一个均匀的概率分布。此外,通过在每次训练步骤中重新选择部分主题来优化网络,以避免过拟合。我们对该方法进行了实验,结果表明,该方法具有鲁棒性,在 MPIIGaze 和 EyeDiap 数据集上分别取得了 3.89 和 4.42 的分数。此外,我们通过进一步实验使用包含不同风格的人脸图像来验证了积极泛化效果。
https://arxiv.org/abs/2401.13865
Despite the recent remarkable achievement in gaze estimation, efficient and accurate personalization of gaze estimation without labels is a practical problem but rarely touched on in the literature. To achieve efficient personalization, we take inspiration from the recent advances in Natural Language Processing (NLP) by updating a negligible number of parameters, "prompts", at the test time. Specifically, the prompt is additionally attached without perturbing original network and can contain less than 1% of a ResNet-18's parameters. Our experiments show high efficiency of the prompt tuning approach. The proposed one can be 10 times faster in terms of adaptation speed than the methods compared. However, it is non-trivial to update the prompt for personalized gaze estimation without labels. At the test time, it is essential to ensure that the minimizing of particular unsupervised loss leads to the goals of minimizing gaze estimation error. To address this difficulty, we propose to meta-learn the prompt to ensure that its updates align with the goal. Our experiments show that the meta-learned prompt can be effectively adapted even with a simple symmetry loss. In addition, we experiment on four cross-dataset validations to show the remarkable advantages of the proposed method.
尽管在 gaze 估计方面取得了最近引人注目的成就,但 gaze 估计没有标签的准确且高效个性化仍然是一个实际问题,但很少在文献中涉及。为了实现高效的个性化,我们受到了自然语言处理(NLP)领域最近取得的进展的启发,在测试时更新了极少量的参数,“提示”(prompts)。具体来说,我们还在原始网络之外附加了提示,且不会对 ResNet-18 的参数造成较大影响。我们的实验表明,提示调整方法具有很高的效率。与比较方法相比,所提出的个性化 gaze 估计方法可以快 10 倍。然而,为个性化 gaze 估计没有标签,更新提示并不容易。在测试时,确保最小化特定无监督损失导致 gaze 估计误差最小化目标是至关重要的。为了应对这一困难,我们提出了一种元学习提示的方法,以确保其更新符合目标。我们的实验表明,即使使用简单的对称损失,元学习提示也可以有效地适应。此外,我们还对四个跨数据集的实验进行了研究,以展示所提出方法的优势。
https://arxiv.org/abs/2401.01577
Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.
简介:在人类-计算机交互和行为研究的领域,精确实时眼神检测是至关重要的。传统方法通常依赖于昂贵的设备或大量数据,这在许多场景下是不切实际的。本文介绍了一种新颖的基于几何的方法来解决这些挑战,利用消费级硬件实现更广泛的适用性。方法:我们利用具有快速检测消费者级芯片上面部关键点的神经网络来生成准确且稳定的面部和眼睛的三维关键点。从中,我们导出一个基于几何的描述符,构成一个8维的流形,表示眼和头的运动。这些描述符随后被用来形成预测眼 gaze 方向的线性方程。结果:我们的方法在角误差不到1.9度的情况下,展示了与最先进的系统相媲美的能力,同时在实时操作中,且对计算资源的需求非常小。结论:所开发的方法在目光检测技术上取得了显著的突破,为传统系统提供了一种高准确度、高效和易用性的替代方案。这为各种领域的实时应用提供了新的可能性,从游戏到心理学研究。
https://arxiv.org/abs/2401.00406
Over the past decade, visual gaze estimation has garnered growing attention within the research community, thanks to its wide-ranging application scenarios. While existing estimation approaches have achieved remarkable success in enhancing prediction accuracy, they primarily infer gaze directions from single-image signals and discard the huge potentials of the currently dominant text guidance. Notably, visual-language collaboration has been extensively explored across a range of visual tasks, such as image synthesis and manipulation, leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing gaze estimation approaches ignore the rich semantic cues conveyed by linguistic signals and priors in CLIP feature space, thereby yielding performance setbacks. In pursuit of making up this gap, we delve deeply into the text-eye collaboration protocol and introduce a novel gaze estimation framework in this paper, referred to as GazeCLIP. Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues. Additionally, a CLIP-based backbone that excels in characterizing text-eye pairs for gaze estimation is presented. This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling the interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP which surpasses the previous approaches and achieves the state-of-the-art estimation accuracy.
在过去的十年里,视觉注视估计在研究社区中引起了越来越多的关注,这要归功于其广泛的应用场景。虽然现有的估计方法在提高预测准确性方面取得了显著的成功,但它们主要从单图像信号中推断目光方向并丢弃了当前占主导地位的文本指导的巨大潜力。值得注意的是,在广泛的视觉任务中,例如图像合成和操作,我们广泛研究了视觉-语言合作,利用了大型对比性语言-图像预训练(CLIP)模型的显著可转移性。然而,现有的注视估计方法忽略了CLIP特征空间中语言信号和先验所传达的丰富语义线索,从而导致性能下降。为了填补这一空白,我们深入研究了文本-眼协作协议,并在本文中引入了一个名为GazeCLIP的新注视估计框架。具体来说,我们精心设计了一个语言描述生成器,用于产生带有粗略方向线索的文本信号。此外,我们还介绍了在CLIP基础上用于 gaze estimation 的骨干网络。接着,我们实现了一个细粒度的多模态融合模块,旨在建模异质输入之间的相互关系。在三个具有挑战性的数据集上的广泛实验证明,所提出的GazeCLIP具有优越性,超越了以前的方法,实现了与最佳估计精度相当的结果。
https://arxiv.org/abs/2401.00260
Gaze estimation has become a subject of growing interest in recent research. Most of the current methods rely on single-view facial images as input. Yet, it is hard for these approaches to handle large head angles, leading to potential inaccuracies in the estimation. To address this issue, adding a second-view camera can help better capture eye appearance. However, existing multi-view methods have two limitations. 1) They require multi-view annotations for training, which are expensive. 2) More importantly, during testing, the exact positions of the multiple cameras must be known and match those used in training, which limits the application scenario. To address these challenges, we propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution in this paper, the Unsupervised 1-to-2 Views Adaptation framework for Gaze estimation (UVAGaze). Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras. Here, the "flexibly" means we place the dual cameras in arbitrary places regardless of the training data, without knowing their extrinsic parameters. Specifically, the UVAGaze builds a dual-view mutual supervision adaptation strategy, which takes advantage of the intrinsic consistency of gaze directions between both views. In this way, our method can not only benefit from common single-view pre-training, but also achieve more advanced dual-view gaze estimation. The experimental results show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings, with a substantial improvement of 47.0%. Project page: this https URL.
目光估计已经成为最近研究的一个热门主题。大多数现有方法依赖于单视图面部图像作为输入。然而,对于这些方法来说处理大的头角度是困难的,导致估计精度存在潜在误差。为了应对这个问题,添加第二个视角的相机可以帮助更好地捕捉眼部特征。然而,现有的多视图方法有两个限制。1)它们需要多视图注释来进行训练,这需要花费大量的资金。2)更重要的是,在测试时,多个相机的精确位置必须知道并与其训练时使用的位置相匹配,这限制了应用场景。为了应对这些挑战,本文提出了一种新颖的1-视-2-视(1-到2视)适应解决方案,即无监督的1-到2视适应框架(UVAGaze)。我们的方法将传统的单视目光估计算法适应于灵活放置的双相机。这里的"灵活"意味着我们将双相机在任何位置放置,而不知道它们的非线性参数。具体来说,UVAGaze构建了一种双视 mutual supervision adaptation strategy,利用了两视之间目光方向的固有一致性。这样,我们的方法不仅可以从常见的单视预训练中受益,还可以实现更高级的双视目光估计。实验结果表明,将目光估计算法适应双视可以实现更高的准确度,尤其是在跨数据集设置中,其准确度提高了47.0%。项目页面:此链接。
https://arxiv.org/abs/2312.15644
Human eye gaze estimation is an important cognitive ingredient for successful human-robot interaction, enabling the robot to read and predict human behavior. We approach this problem using artificial neural networks and build a modular system estimating gaze from separately cropped eyes, taking advantage of existing well-functioning components for face detection (RetinaFace) and head pose estimation (6DRepNet). Our proposed method does not require any special hardware or infrared filters but uses a standard notebook-builtin RGB camera, as often approached with appearance-based methods. Using the MetaHuman tool, we also generated a large synthetic dataset of more than 57,000 human faces and made it publicly available. The inclusion of this dataset (with eye gaze and head pose information) on top of the standard Columbia Gaze dataset into training the model led to better accuracy with a mean average error below two degrees in eye pitch and yaw directions, which compares favourably to related methods. We also verified the feasibility of our model by its preliminary testing in real-world setting using the builtin 4K camera in NICO semi-humanoid robot's eye.
人类眼睛注视估计是成功的人机交互的重要认知成分,使机器人能够阅读和预测人类行为。我们通过人工神经网络解决这个问题,并建立了一个模块系统,从分别裁剪的眼睛中估计注视,利用现有的面部检测(RetinaFace)和头姿态估计(6DRepNet)的成熟组件。我们提出的方法不需要特殊的硬件或红外滤镜,而是利用了一个标准的笔记本内置的RGB相机,通常与基于外观的方法相同。使用元人类工具,我们还生成了超过57,000个合成面部数据集,并将其公开发布。在将这个数据集(带有眼部和头姿态信息)放在标准的哥伦比亚 gaze数据集中训练模型后,我们在眼俯仰和眼偏转方向上的平均误差低于两度,与相关方法相比具有优势。我们还通过使用NICO半人形机器人预先测试模型中的内置4K相机来验证我们模型的可行性。
https://arxiv.org/abs/2311.14175
Gaze estimation is a valuable technology with numerous applications in fields such as human-computer interaction, virtual reality, and medicine. This report presents the implementation of a gaze estimation system using the Sony Spresense microcontroller board and explores its performance in latency, MAC/cycle, and power consumption. The report also provides insights into the system's architecture, including the gaze estimation model used. Additionally, a demonstration of the system is presented, showcasing its functionality and performance. Our lightweight model TinyTrackerS is a mere 169Kb in size, using 85.8k parameters and runs on the Spresense platform at 3 FPS.
凝视估计是一种有价值的技术,在诸如人机交互、虚拟现实和医疗等领域具有众多应用。本报告使用索尼Spresense微控制器板实现了凝视估计系统的部署,并探讨了其在延迟、MAC循环和功耗方面的性能。报告还提供了系统架构的见解,包括使用的凝视估计模型。此外,系统功能和性能的演示也被呈现出来。我们的轻量级模型TinyTrackerS仅占169Kb,使用85.8k个参数,并运行在Spresense平台上,采样率为3 FPS。
https://arxiv.org/abs/2308.12313
The advent of foundation models signals a new era in artificial intelligence. The Segment Anything Model (SAM) is the first foundation model for image segmentation. In this study, we evaluate SAM's ability to segment features from eye images recorded in virtual reality setups. The increasing requirement for annotated eye-image datasets presents a significant opportunity for SAM to redefine the landscape of data annotation in gaze estimation. Our investigation centers on SAM's zero-shot learning abilities and the effectiveness of prompts like bounding boxes or point clicks. Our results are consistent with studies in other domains, demonstrating that SAM's segmentation effectiveness can be on-par with specialized models depending on the feature, with prompts improving its performance, evidenced by an IoU of 93.34% for pupil segmentation in one dataset. Foundation models like SAM could revolutionize gaze estimation by enabling quick and easy image segmentation, reducing reliance on specialized models and extensive manual annotation.
基础模型模型的出现标志着人工智能进入了一个新的时代。SAM 是图像分割的基础模型中的第一个。在这项研究中,我们评估了 SMA 对虚拟现实设置中记录的眼部图像特征进行分割的能力。对标注眼部图像数据集的需求不断增加,这为SMA重新定义数据注释格局提供了巨大的机会。我们的研究集中在SMA的零样本学习能力和提示(如边界框或点击)的有效性上。我们的结果与其他领域的研究结果一致,证明了SMA的分割效果可以与根据特定功能进行专门建模的模型相媲美,而提示可以提高其性能,这可以通过在一个数据集上的瞳孔分割的IoU为93.34%来说明。基于SMA的基础模型可以凭借快速且易於理解的图像分割,减少对专门模型的依赖,以及减少大量手动注释,从而颠覆凝视估计。
https://arxiv.org/abs/2311.08077
DeepFake detection is pivotal in personal privacy and public safety. With the iterative advancement of DeepFake techniques, high-quality forged videos and images are becoming increasingly deceptive. Prior research has seen numerous attempts by scholars to incorporate biometric features into the field of DeepFake detection. However, traditional biometric-based approaches tend to segregate biometric features from general ones and freeze the biometric feature extractor. These approaches resulted in the exclusion of valuable general features, potentially leading to a performance decline and, consequently, a failure to fully exploit the potential of biometric information in assisting DeepFake detection. Moreover, insufficient attention has been dedicated to scrutinizing gaze authenticity within the realm of DeepFake detection in recent years. In this paper, we introduce GazeForensics, an innovative DeepFake detection method that utilizes gaze representation obtained from a 3D gaze estimation model to regularize the corresponding representation within our DeepFake detection model, while concurrently integrating general features to further enhance the performance of our model. Experiment results reveal that our proposed GazeForensics outperforms the current state-of-the-art methods.
深度伪造检测在个人隐私和公共安全方面具有关键作用。随着深度伪造技术的迭代进步,高品质伪造视频和图像变得越来越欺骗性。之前的研究表明,学者们试图将生物特征引入深度伪造检测领域。然而,传统基于生物特征的方法通常将生物特征与一般特征分离并冻结生物特征提取器。这些方法导致价值的一般特征被排除,可能導致性能下降,从而无法充分利用生物信息在协助深度伪造检测中的潜在能量。此外,在最近几年,关于深度伪造检测领域中眼神真实性的审查不足。在本文中,我们介绍了GazeForensics,一种创新的深度伪造检测方法,它利用从3D gaze估计模型获得的眼神表示来规范我们深度伪造检测模型中的相应表示,同时将一般特征集成到我们的模型中,以进一步增强模型的性能。实验结果表明,我们提出的GazeForensics超越了现有技术的水平。
https://arxiv.org/abs/2311.07075
Although the number of gaze estimation datasets is growing, the application of appearance-based gaze estimation methods is mostly limited to estimating the point of gaze on a screen. This is in part because most datasets are generated in a similar fashion, where the gaze target is on a screen close to camera's origin. In other applications such as assistive robotics or marketing research, the 3D point of gaze might not be close to the camera's origin, meaning models trained on current datasets do not generalize well to these tasks. We therefore suggest generating a textured tridimensional mesh of the face and rendering the training images from a virtual camera at a specific position and orientation related to the application as a mean of augmenting the existing datasets. In our tests, this lead to an average 47% decrease in gaze estimation angular error.
尽管 gaze estimation 数据集的数量正在增加,但基于外观的 gaze 估计方法的应用主要限于在屏幕上估计眼神。这部分是因为大多数数据集是在类似于相机起源的位置和方向上生成的。在辅助机器人学或市场研究等应用中,眼神的 3D 点可能并不接近相机起源,这意味着用当前数据集训练的模型在这些任务上表现不好。因此,我们建议生成一个纹理化的三维人脸网格,并从与应用程序相关的虚拟摄像机在特定位置和方向上渲染训练图像,作为增加现有数据集的 means。在我们的测试中,这导致 gaze 估计角误差平均降低了 47%。
https://arxiv.org/abs/2310.18469
In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to facilitate video gaze estimation via capturing spatial-temporal interaction context among head, face, and eye in an end-to-end learning way, which has not been well concerned yet. The main advantage of MCGaze is that the tasks of clue localization of head, face, and eye can be solved jointly for gaze estimation in a one-step way, with joint optimization to seek optimal performance. During this, spatial-temporal context exchange happens among the clues on the head, face, and eye. Accordingly, the final gazes obtained by fusing features from various queries can be aware of global clues from heads and faces, and local clues from eyes simultaneously, which essentially leverages performance. Meanwhile, the one-step running way also ensures high running efficiency. Experiments on the challenging Gaze360 dataset verify the superiority of our proposition. The source code will be released at this https URL.
在这封信中,我们提出了一个新的方法,Multi-Clue Gaze(MCGaze),以通过捕捉头部、面部和眼睛之间的空间-时间交互上下文来通过端到端学习方式促进视频目光估计,这还没有得到很好的关注。MCGaze的主要优点是,可以协同解决头、面和眼的提示局部定位问题,一次性获得最佳性能。在这个过程中,头、面和眼之间的空间-时间上下文交流发生在提示上。因此,通过融合来自各种查询的特征,最终获得的目光可以同时意识到来自头和面的全局提示以及来自眼睛的局部提示,这本质上代表了性能。同时,一步运行方式还确保了高运行效率。对具有挑战性的Gaze360数据集的实验证实了我们的提议具有优越性。源代码将在此处https URL上发布。
https://arxiv.org/abs/2310.18131
Intelligent edge vision tasks encounter the critical challenge of ensuring power and latency efficiency due to the typically heavy computational load they impose on edge platforms.This work leverages one of the first "AI in sensor" vision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power end-to-end edge vision applications. We evaluate the IMX500 and compare it to other edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by exploring gaze estimation as a case study. We propose TinyTracker, a highly efficient, fully quantized model for 2D gaze estimation designed to maximize the performance of the edge vision systems considered in this study. TinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1] without significant loss in gaze estimation accuracy (maximum of 0.16 cm when fully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor results in end-to-end latency of around 19ms. The camera takes around 17.9ms to read, process and transmit the pixels to the accelerator. The inference time of the network is 0.86ms with an additional 0.24 ms for retrieving the results from the sensor. The overall energy consumption of the end-to-end system is 4.9 mJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is 1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ VS 34.2mJ)
智能边缘视觉任务的关键挑战之一是确保由于它们对边缘平台通常繁重计算负载,导致功耗和延迟效率低下。本文利用索尼IMX500第一个"AI在传感器"视觉平台,实现了超快速和超低功耗的端到端边缘视觉应用。我们通过探索目光估计作为一个案例研究来评估IMX500和比较它与其他边缘平台,如谷歌CoralDev Micro和索尼Spresense。我们提出了TinyTracker,一个高效且量化模型,专为最大化考虑本研究中使用的边缘视觉系统的性能而设计。TinyTracker在未显著降低目光估计精度(完全量化时最大为0.16厘米)的情况下,实现了41倍于iTracker [1]的尺寸缩小(600Kb)。TinyTracker在索尼IMX500视觉传感器上的部署导致端到端延迟约为19毫秒。相机读取、处理和传输像素到加速器的时间大约为17.9毫秒。网络的推理时间为0.86毫秒,此外还有0.24毫秒用于从传感器中检索结果。端到端系统的总功耗为4.9毫焦,包括0.06毫焦的推理功耗。端到端研究表明,IMX500是CoralMicro的1.7倍(19毫秒 vs 34.4毫秒)和7倍更高效的能源效率(4.9毫焦 vs 34.2毫焦)。
https://arxiv.org/abs/2307.07813
Purpose: Metrics derived from eye-gaze-tracking and pupillometry show promise for cognitive load assessment, potentially enhancing training and patient safety through user-specific feedback in tele-robotic surgery. However, current eye-tracking solutions' effectiveness in tele-robotic surgery is uncertain compared to everyday situations due to close-range interactions causing extreme pupil angles and occlusions. To assess the effectiveness of modern eye-gaze-tracking solutions in tele-robotic surgery, we compare the Tobii Pro 3 Glasses and Pupil Labs Core, evaluating their pupil diameter and gaze stability when integrated with the da Vinci Research Kit (dVRK). Methods: The study protocol includes a nine-point gaze calibration followed by pick-and-place task using the dVRK and is repeated three times. After a final calibration, users view a 3x3 grid of AprilTags, focusing on each marker for 10 seconds, to evaluate gaze stability across dVRK-screen positions with the L2-norm. Different gaze calibrations assess calibration's temporal deterioration due to head movements. Pupil diameter stability is evaluated using the FFT from the pupil diameter during the pick-and-place tasks. Users perform this routine with both head-worn eye-tracking systems. Results: Data collected from ten users indicate comparable pupil diameter stability. FFTs of pupil diameters show similar amplitudes in high-frequency components. Tobii Glasses show more temporal gaze stability compared to Pupil Labs, though both eye trackers yield a similar 4cm error in gaze estimation without an outdated calibration. Conclusion: Both eye trackers demonstrate similar stability of the pupil diameter and gaze, when the calibration is not outdated, indicating comparable eye-tracking and pupillometry performance in tele-robotic surgery settings.
目的:来自眼动追踪和脉搏计量的指标在认知负荷评估方面具有潜力,可能在远程机器人手术中通过用户特定反馈提高培训和患者安全性。然而,与日常情况相比,当前的眼动追踪解决方案在远程机器人手术中的效果尚不确定,因为近距离交互导致极端瞳孔角和遮挡。为了评估现代眼动追踪解决方案在远程机器人手术中的有效性,我们比较了 Tobii Pro 3 眼镜和 Pupil Labs Core,评估了它们与 da Vinci 研究工具(dVRK)集成时的瞳孔直径和 gaze stability。方法:研究方案包括九个点位的眼动校准,然后使用 dVRK 进行选取和放置任务,接着重复三次。在最后一次校准之后,用户观看一个 3x3 的 AprilTags 网格,专注于每个标记物 10 秒钟,以评估 dVRK 屏幕位置下的 gaze stability,使用 L2 范数进行度量。不同的眼动校准评估了由于头部运动导致的校准时间衰减。瞳孔直径稳定性通过从瞳孔直径的 FFT 进行评估。用户使用带头戴的眼动追踪系统完成此日常任务。结果:从十名用户收集的数据表明,瞳孔直径稳定性相当。瞳孔直径的高频分量显示相似的振幅。尽管如此,Tobii 眼镜在时间上的 gaze stability 相比 Pupil Labs 更高,尽管两种眼追踪器在没有过时的校准时都产生了一个类似 4cm 的误差。结论:当校准不过时的情况下,两种眼追踪器表现出相似的瞳孔直径和 gaze 稳定性,表明在远程机器人手术环境中,它们的眼动追踪和脉搏计量性能相当。
https://arxiv.org/abs/2310.13720