Gaze estimation methods encounter significant performance deterioration when being evaluated across different domains, because of the domain gap between the testing and training data. Existing methods try to solve this issue by reducing the deviation of data distribution, however, they ignore the existence of label deviation in the data due to the acquisition mechanism of the gaze label and the individual physiological differences. In this paper, we first point out that the influence brought by the label deviation cannot be ignored, and propose a gaze label alignment algorithm (GLA) to eliminate the label distribution deviation. Specifically, we first train the feature extractor on all domains to get domain invariant features, and then select an anchor domain to train the gaze regressor. We predict the gaze label on remaining domains and use a mapping function to align the labels. Finally, these aligned labels can be used to train gaze estimation models. Therefore, our method can be combined with any existing method. Experimental results show that our GLA method can effectively alleviate the label distribution shift, and SOTA gaze estimation methods can be further improved obviously.
https://arxiv.org/abs/2412.15601
Uncertainty in gaze estimation manifests in two aspects: 1) low-quality images caused by occlusion, blurriness, inconsistent eye movements, or even non-face images; 2) incorrect labels resulting from the misalignment between the labeled and actual gaze points during the annotation process. Allowing these uncertainties to participate in training hinders the improvement of gaze estimation. To tackle these challenges, in this paper, we propose an effective solution, named Suppressing Uncertainty in Gaze Estimation (SUGE), which introduces a novel triplet-label consistency measurement to estimate and reduce the uncertainties. Specifically, for each training sample, we propose to estimate a novel ``neighboring label'' calculated by a linearly weighted projection from the neighbors to capture the similarity relationship between image features and their corresponding labels, which can be incorporated with the predicted pseudo label and ground-truth label for uncertainty estimation. By modeling such triplet-label consistency, we can measure the qualities of both images and labels, and further largely reduce the negative effects of unqualified images and wrong labels through our designed sample weighting and label correction strategies. Experimental results on the gaze estimation benchmarks indicate that our proposed SUGE achieves state-of-the-art performance.
凝视估计中的不确定性体现在两个方面:1) 由遮挡、模糊、不一致的眼部运动,甚至是非人脸图像引起的低质量图像;2) 在标注过程中,标记的注视点与实际注视点之间的错位导致错误标签。允许这些不确定性参与训练会阻碍凝视估计算法性能的提升。为解决这些问题,在本文中我们提出了一种有效的解决方案,称为抑制凝视估计中的不确定性(SUGE),该方法引入了一种新的三重标签一致性测量方式来估算并减少不确定性。具体来说,对于每个训练样本,我们建议估算一个通过从邻居处进行线性加权投影计算得出的新型“邻近标签”,以捕捉图像特征与其对应标签之间的相似关系。此邻近标签可以与预测的伪标签和真实标签一起用于不确定性估计。通过建模这种三重标签一致性,我们可以衡量图像和标签的质量,并进一步通过我们设计的样本权重调整及标签校正策略大幅减少不合格图像和错误标签带来的负面影响。在凝视估计算法基准测试中的实验结果表明,我们提出的SUGE达到了最先进的性能水平。
https://arxiv.org/abs/2412.12890
In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maximization algorithm. First, the proposed Global Attention Mechanism(GAM) is added to extract features related to gaze estimation to improve the model's ability to capture global dependencies and thus improve its performance. Second, by learning hierarchical feature representations through the EM module, the model has strong generalization ability, which reduces the need for sample size. Experiments have confirmed that, on the premise of using only 50% of the training data, EM-Net improves the performance of Gaze360, MPIIFaceGaze, and RT-Gene datasets by 2.2%, 2.02%, and 2.03%, respectively, compared with GazeNAS-ETH. It also shows good robustness in the face of Gaussian noise interference.
近年来,凝视估计技术的准确性逐渐提高,但现有的方法通常依赖于大规模数据集或大型模型来提升性能,这导致对计算资源有较高的需求。针对这一问题,本文提出了一种基于深度学习和传统机器学习算法——期望最大化(EM)算法的轻量级凝视估计模型 EM-Net。首先,提出的全局注意力机制(GAM)被用于提取与凝视估计相关的特征,以增强模型捕捉全局依赖关系的能力,从而提升其性能。其次,通过 EM 模块学习分层特征表示,该模型具有较强的泛化能力,减少了对样本量的需求。实验结果表明,在仅使用 50% 训练数据的情况下,EM-Net 相比 GazeNAS-ETH 分别提升了 Gaze360、MPIIFaceGaze 和 RT-Gene 数据集的性能 2.2%、2.02% 和 2.03%。此外,在面对高斯噪声干扰时,该模型也表现出良好的鲁棒性。
https://arxiv.org/abs/2412.08074
Using lightweight models as backbone networks in gaze estimation tasks often results in significant performance degradation. The main reason is that the number of feature channels in lightweight networks is usually small, which makes the model expression ability limited. In order to improve the performance of lightweight models in gaze estimation tasks, a network model named Multitask-Gaze is proposed. The main components of Multitask-Gaze include Unidirectional Convolution (UC), Spatial and Channel Attention (SCA), Global Convolution Module (GCM), and Multi-task Regression Module(MRM). UC not only significantly reduces the number of parameters and FLOPs, but also extends the receptive field and improves the long-distance modeling capability of the model, thereby improving the model performance. SCA highlights gaze-related features and suppresses gaze-irrelevant features. The GCM replaces the pooling layer and avoids the performance degradation due to information loss. MRM improves the accuracy of individual tasks and strengthens the connections between tasks for overall performance improvement. The experimental results show that compared with the State-of-the-art method SUGE, the performance of Multitask-Gaze on MPIIFaceGaze and Gaze360 datasets is improved by 1.71% and 2.75%, respectively, while the number of parameters and FLOPs are significantly reduced by 75.5% and 86.88%.
使用轻量级模型作为凝视估计任务的骨干网络通常会导致性能显著下降。主要原因在于轻量级网络中的特征通道数量通常较少,这限制了模型的表现能力。为了提高轻量级模型在凝视估计任务中的表现,提出了一种名为Multitask-Gaze的网络模型。Multitask-Gaze的主要组成部分包括单向卷积(UC)、空间和通道注意力(SCA)、全局卷积模块(GCM)以及多任务回归模块(MRM)。UC不仅大幅减少了参数数量和计算量(FLOPs),还扩展了感受野并提升了模型的长距离建模能力,从而提高了模型性能。SCA突出与凝视相关的特征,并抑制无关特征。GCM替换了池化层,避免了由于信息丢失导致的表现下降。MRM提高了各个任务的准确性,并加强了任务之间的联系以提升整体表现。实验结果显示,相比于最先进的方法SUGE,在MPIIFaceGaze和Gaze360数据集上,Multitask-Gaze的表现分别提升了1.71%和2.75%,同时参数数量和计算量(FLOPs)分别显著减少了75.5%和86.88%。
https://arxiv.org/abs/2411.18061
Gaze estimation encounters generalization challenges when dealing with out-of-distribution data. To address this problem, recent methods use neural radiance fields (NeRF) to generate augmented data. However, existing methods based on NeRF are computationally expensive and lack facial details. 3D Gaussian Splatting (3DGS) has become the prevailing representation of neural fields. While 3DGS has been extensively examined in head avatars, it faces challenges with accurate gaze control and generalization across different subjects. In this work, we propose GazeGaussian, a high-fidelity gaze redirection method that uses a two-stream 3DGS model to represent the face and eye regions separately. By leveraging the unstructured nature of 3DGS, we develop a novel eye representation for rigid eye rotation based on the target gaze direction. To enhance synthesis generalization across various subjects, we integrate an expression-conditional module to guide the neural renderer. Comprehensive experiments show that GazeGaussian outperforms existing methods in rendering speed, gaze redirection accuracy, and facial synthesis across multiple datasets. We also demonstrate that existing gaze estimation methods can leverage GazeGaussian to improve their generalization performance. The code will be available at: this https URL.
当处理分布外数据时,凝视估计会遇到泛化挑战。为了解决这一问题,最近的方法使用神经辐射场(NeRF)生成增强数据。然而,基于NeRF的现有方法计算成本高昂,并且缺乏面部细节。3D高斯喷涂(3DGS)已成为神经领域的主要表示形式。虽然3DGS在头部化身方面得到了广泛研究,但在精确控制凝视和跨不同个体泛化方面仍面临挑战。在本工作中,我们提出了GazeGaussian,这是一种使用双流3DGS模型分别表示脸部和眼部区域的高保真凝视重定向方法。通过利用3DGS的非结构特性,我们开发了一种基于目标凝视方向的刚性眼旋转的新眼睛表示形式。为了增强跨不同个体的合成泛化能力,我们将表情条件模块集成到神经渲染器中以提供指导。综合实验表明,GazeGaussian在渲染速度、凝视重定向准确性和多数据集面部合成方面优于现有方法。我们还展示了现有的凝视估计方法可以利用GazeGaussian来提升其泛化性能。代码将在以下网址提供:此 https URL。
https://arxiv.org/abs/2411.12981
The ability of gaze estimation models to generalize is often significantly hindered by various factors unrelated to gaze, especially when the training dataset is limited. Current strategies aim to address this challenge through different domain generalization techniques, yet they have had limited success due to the risk of overfitting when solely relying on value labels for regression. Recent progress in pre-trained vision-language models has motivated us to capitalize on the abundant semantic information available. We propose a novel approach in this paper, reframing the gaze estimation task as a vision-language alignment issue. Our proposed framework, named Language-Guided Gaze Estimation (LG-Gaze), learns continuous and geometry-sensitive features for gaze estimation benefit from the rich prior knowledges of vision-language models. Specifically, LG-Gaze aligns gaze features with continuous linguistic features through our proposed multimodal contrastive regression loss, which customizes adaptive weights for different negative samples. Furthermore, to better adapt to the labels for gaze estimation task, we propose a geometry-aware interpolation method to obtain more precise gaze embeddings. Through extensive experiments, we validate the efficacy of our framework in four different cross-domain evaluation tasks.
凝视估计模型的泛化能力经常受到与凝视无关的各种因素的显著阻碍,尤其是在训练数据集有限的情况下。当前策略旨在通过不同的领域泛化技术来解决这一挑战,但由于仅依赖回归的价值标签容易导致过拟合,这些策略收效甚微。最近在预训练视觉-语言模型方面的进展促使我们利用丰富的语义信息。本文提出了一种新颖的方法,将凝视估计任务重新定义为一个视觉-语言对齐问题。我们的框架名为“语言引导的凝视估计”(LG-Gaze),它通过利用视觉-语言模型丰富的先验知识来学习连续且几何敏感的特征以进行凝视估计。具体来说,LG-Gaze 通过我们提出的多模态对比回归损失将凝视特征与连续的语言特征对齐,该损失能够为不同的负样本定制适应性权重。此外,为了更好地适应凝视估计任务的标签,我们提出了一种几何感知插值方法来获得更精确的凝视嵌入。通过广泛的实验验证了我们的框架在四个不同跨域评估任务中的有效性。
https://arxiv.org/abs/2411.08606
We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.
我们推出了GazeGen,一个用户交互系统,该系统能够根据用户的目光指示生成视觉内容(图像和视频)。通过使用注视点来定位感兴趣区域,GazeGen实现了对视觉内容的直观操作。利用先进的物体检测技术和生成式AI技术,GazeGen可以执行受控于目光的图像添加/删除、重新定位以及图像对象表面材质的变化,并且能够将静态图像转换成视频。核心组件是DFT Gaze(蒸馏和微调注视点)代理,这是一个超轻量级模型,仅包含281K参数,在小型边缘设备上实现了针对个体用户眼睛的精确实时注视预测。GazeGen是首个结合视觉内容生成与实时注视估计的系统,这完全得益于DFT Gaze的功能。 通过实时注视估计技术,所有视觉内容生成任务均可由用户的目光控制。DFT Gaze的输入为用户的眼部图像,而视觉内容生成则需要用户的视角和来自DFT Gaze预测的注视点作为输入。为了实现高效的注视预测,我们通过新颖的知识蒸馏和个人适应技术,从一个大型模型(原模型大小的10倍)中导出了这个小型模型。我们将知识蒸馏与掩码自动编码器相结合,开发出了一款紧凑且强大的注视估计模型。该模型进一步使用适配器进行微调,使得只需极少量的用户输入就能实现高度准确和个性化的注视预测。DFT Gaze确保了低延迟和精确的目光跟踪能力,支持广泛的眼动驱动任务。 我们在AEA和OpenEDS2020基准测试上验证了DFT Gaze的表现,证明其在边缘设备(树莓派4)上的角度误差小且延迟低。此外,我们还描述了GazeGen的应用实例,展示了其在各种使用场景中的多样性和有效性。
https://arxiv.org/abs/2411.04335
Gaze is a crucial social cue in any interacting scenario and drives many mechanisms of social cognition (joint and shared attention, predicting human intention, coordination tasks). Gaze direction is an indication of social and emotional functions affecting the way the emotions are perceived. Evidence shows that embodied humanoid robots endowing social abilities can be seen as sophisticated stimuli to unravel many mechanisms of human social cognition while increasing engagement and ecological validity. In this context, building a robotic perception system to automatically estimate the human gaze only relying on robot's sensors is still demanding. Main goal of the paper is to propose a learning robotic architecture estimating the human gaze direction in table-top scenarios without any external hardware. Table-top tasks are largely used in many studies in experimental psychology because they are suitable to implement numerous scenarios allowing agents to collaborate while maintaining a face-to-face interaction. Such an architecture can provide a valuable support in studies where external hardware might represent an obstacle to spontaneous human behaviour, especially in environments less controlled than the laboratory (e.g., in clinical settings). A novel dataset was also collected with the humanoid robot iCub, including images annotated from 24 participants in different gaze conditions.
注视是任何交互场景中至关重要的社会线索,驱动了许多社会认知机制(共同注意和共享注意力、预测人类意图、协调任务)。注视方向是对社会和情感功能的指示,影响情绪感知的方式。证据表明,赋予社交能力的身体化类人机器人可以被视为复杂刺激物,以揭示许多人类社会认知机制,同时增加参与度和生态有效性。在此背景下,构建一个仅依赖于机器人传感器自动估算人类注视的机器人感知系统仍然是一项挑战。本文的主要目标是提出一种学习型机器人架构,在桌面上的情境中无需任何外部硬件就能估计人类的注视方向。桌面任务在许多实验心理学研究中广泛使用,因为它们适合实施多种场景,允许代理在保持面对面互动的同时进行协作。这种架构可以在研究中提供有价值的帮助,尤其是在实验室以外、控制程度较低的环境中(如临床环境),外部硬件可能成为自发行为的障碍。此外,还用类人机器人iCub收集了一组新的数据集,包括从24名参与者在不同注视条件下注释的图像。
https://arxiv.org/abs/2410.19374
We explore the transformative potential of SAM 2, a vision foundation model, in advancing gaze estimation and eye tracking technologies. By significantly reducing annotation time, lowering technical barriers through its ease of deployment, and enhancing segmentation accuracy, SAM 2 addresses critical challenges faced by researchers and practitioners. Utilizing its zero-shot segmentation capabilities with minimal user input-a single click per video-we tested SAM 2 on over 14 million eye images from diverse datasets, including virtual reality setups and the world's largest unified dataset recorded using wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches the performance of domain-specific models trained solely on eye images, achieving competitive mean Intersection over Union (mIoU) scores of up to 93% without fine-tuning. Additionally, we provide our code and segmentation masks for these widely used datasets to promote further research.
https://arxiv.org/abs/2410.08926
Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person's face, thus promoting privacy preservation in various application contexts. The code is available at this https URL.
凝视目标检测(GTD)是一种具有挑战性的任务,特别是在三维空间中。现有的方法很大程度上依赖于分析人的外貌,主要集中在其脸上预测 gaze 目标。本文提出了一种新方法来解决这个问题,通过利用人的上半身姿态和可用的深度图来提取 3D gaze 方向,并采用多阶段或端到端管道来预测 gaze 目标。当预测准确时,人体姿态可以提供关于头姿的信息,这是 gaze 方向的近似,以及手臂和手的位置,这些与人们正在进行的活动和他们正在关注的物体有关。因此,我们在进行 3D 凝视估计的同时,也能够进行 GTD。我们在没有要求预测人面图像的公开可访问的三维凝视目标检测数据集上证明了最先进的结果,从而在各种应用场景中促进了隐私保护。代码可在此处访问:https://url.cn/xyz6h
https://arxiv.org/abs/2409.17886
Current video-based computer vision (CV) applications typically suffer from high energy consumption due to reading and processing all pixels in a frame, regardless of their significance. While previous works have attempted to reduce this energy by skipping input patches or pixels and using feedback from the end task to guide the skipping algorithm, the skipping is not performed during the sensor read phase. As a result, these methods can not optimize the front-end sensor energy. Moreover, they may not be suitable for real-time applications due to the long latency of modern CV networks that are deployed in the back-end. To address this challenge, this paper presents a custom-designed reconfigurable CMOS image sensor (CIS) system that improves energy efficiency by selectively skipping uneventful regions or rows within a frame during the sensor's readout phase, and the subsequent analog-to-digital conversion (ADC) phase. A novel masking algorithm intelligently directs the skipping process in real-time, optimizing both the front-end sensor and back-end neural networks for applications including autonomous driving and augmented/virtual reality (AR/VR). Our system can also operate in standard mode without skipping, depending on application needs. We evaluate our hardware-algorithm co-design framework on object detection based on BDD100K and ImageNetVID, and gaze estimation based on OpenEDS, achieving up to 53% reduction in front-end sensor energy while maintaining state-of-the-art (SOTA) accuracy.
当前基于视频的计算机视觉(CV)应用通常由于读取和处理帧中所有像素而消耗高能量,无论这些像素的重要性如何。虽然以前的工作试图通过跳过输入补丁或像素,并使用后任务的反馈来指导跳过算法来降低能量,但在传感器读取阶段并没有进行跳过。因此,这些方法无法优化前端传感器能量。此外,由于现代CV网络在后台部署,它们可能不适用于实时应用,因为它们具有较长的延迟。为了应对这一挑战,本文提出了一种自定义设计的可重构CMOS图像传感器(CIS)系统,通过在传感器读取阶段跳过帧中的不重要区域或行,以及在后续的模拟-数字转换(ADC)阶段优化前端传感器和后端神经网络,从而提高能量效率。一种新颖的掩码算法在实时过程中智能地指导跳过过程,为包括自动驾驶和增强/虚拟现实(AR/VR)在内的应用优化前端传感器和后端神经网络。我们的系统可以根据应用需要以非跳过方式运行。我们在基于BDD100K和ImageNetVID的物体检测和基于OpenEDS的凝视估计上进行了评估,实现了前端传感器能量减少53%的同时保持最佳(SOTA)准确性的效果。
https://arxiv.org/abs/2409.17341
The advent and growing popularity of Virtual Reality (VR) and Mixed Reality (MR) solutions have revolutionized the way we interact with digital platforms. The cutting-edge gaze-controlled typing methods, now prevalent in high-end models of these devices, e.g., Apple Vision Pro, have not only improved user experience but also mitigated traditional keystroke inference attacks that relied on hand gestures, head movements and acoustic side-channels. However, this advancement has paradoxically given birth to a new, potentially more insidious cyber threat, GAZEploit. In this paper, we unveil GAZEploit, a novel eye-tracking based attack specifically designed to exploit these eye-tracking information by leveraging the common use of virtual appearances in VR applications. This widespread usage significantly enhances the practicality and feasibility of our attack compared to existing methods. GAZEploit takes advantage of this vulnerability to remotely extract gaze estimations and steal sensitive keystroke information across various typing scenarios-including messages, passwords, URLs, emails, and passcodes. Our research, involving 30 participants, achieved over 80% accuracy in keystroke inference. Alarmingly, our study also identified over 15 top-rated apps in the Apple Store as vulnerable to the GAZEploit attack, emphasizing the urgent need for bolstered security measures for this state-of-the-art VR/MR text entry method.
虚拟现实(VR)和混合现实(MR)解决方案的的出现和不断增长的使用已经彻底改变了我们与数字平台互动的方式。现在这些设备高端型号中盛行的 gaze-controlled typing 方法,例如苹果视觉Pro,不仅提高了用户体验,还减轻了依赖于手势、头部运动和听觉信道的老式键盘推测攻击。然而,这一进步反而孕育了一个新的、可能更具破坏性的网络威胁,即 GAZEploit。 在本文中,我们揭示了 GAZEploit,一种专门利用虚拟应用程序中常见的虚拟形象来利用这些眼动信息的新型攻击方法。这种广泛的使用大大增强了我们的攻击相对于现有方法的实用性和可行性。GAZEploit 利用了这个漏洞,通过远程提取眼动估计并窃取敏感键盘信息来攻击各种打字场景——包括消息、密码、网址、邮件和密码。我们的研究包括 30 名参与者,在键位推测方面获得了超过 80% 的准确率。令人担忧的是,我们的研究还发现了苹果商店中排名前 15 的应用程序中有超过 15 个应用程序容易受到 GAZEploit 攻击,进一步强调了对于这种最先进的 VR/MR 文本输入方法需要加强安全措施的紧迫性。
https://arxiv.org/abs/2409.08122
Achieving accurate and reliable gaze predictions in complex and diverse environments remains challenging. Fortunately, it is straightforward to access diverse gaze datasets in real-world applications. We discover that training these datasets jointly can significantly improve the generalization of gaze estimation, which is overlooked in previous works. However, due to the inherent distribution shift across different datasets, simply mixing multiple dataset decreases the performance in the original domain despite gaining better generalization abilities. To address the problem of ``cross-dataset gaze estimation'', we propose a novel Evidential Inter-intra Fusion EIF framework, for training a cross-dataset model that performs well across all source and unseen domains. Specifically, we build independent single-dataset branches for various datasets where the data space is partitioned into overlapping subspaces within each dataset for local regression, and further create a cross-dataset branch to integrate the generalizable features from single-dataset branches. Furthermore, evidential regressors based on the Normal and Inverse-Gamma (NIG) distribution are designed to additionally provide uncertainty estimation apart from predicting gaze. Building upon this foundation, our proposed framework achieves both intra-evidential fusion among multiple local regressors within each dataset and inter-evidential fusion among multiple branches by Mixture \textbfof Normal Inverse-Gamma (MoNIG distribution. Experiments demonstrate that our method consistently achieves notable improvements in both source domains and unseen domains.
实现复杂且多样环境中准确可靠的眼神预测仍然具有挑战性。幸运的是,在现实应用中访问多样眼神数据集是直观的。我们发现,联合训练这些数据集可以显著提高眼神估计的泛化能力,这在之前的论文中被忽视了。然而,由于不同数据集中的固有分布差异,即使获得了更好的泛化能力,简单地将多个数据集混合也会导致在原始领域中的性能下降。为解决“跨数据集眼神估计”问题,我们提出了一个新的证据互信融合EIF框架,用于在所有源域和未见域中表现良好的跨数据集模型。具体来说,我们为各种数据集构建了独立的数据集分支,其中数据空间在每个数据集中被划分为重叠子空间进行局部回归,并进一步创建了跨数据集分支,将单数据集分支的通用特征整合到一起。此外,基于Normal和Inverse-Gamma(NIG)分布的证据回归器被设计为除了预测眼神外还提供不确定性估计。基于这个基础,我们提出的框架实现了每个数据集中的内部证据融合和跨分支之间的证据融合。实验证明,我们的方法在源域和未见域上都取得了显著的改进。
https://arxiv.org/abs/2409.04766
Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of head poses, gaze an-gles, illumination, etc. Another reason may be the inconsistency between methods used to define gaze angles (label mismatch). We propose two innovations to improve the performance of gaze estimation by leveraging multiple datasets, a change in the estimator architecture and the intro-duction of a gaze adaptation module. Most state-of-the-art estimators merge information extracted from images of the two eyes and the entire face either in parallel or combine information from the eyes first then with the face. Our proposed Two-stage Transformer-based Gaze-feature Fusion (TTGF) method uses transformers to merge information from each eye and the face separately and then merge across the two eyes. We argue that this improves head pose invariance since changes in head pose affect left and right eye images in different ways. Our proposed Gaze Adaptation Module (GAM) method handles annotation inconsis-tency by applying a Gaze Adaption Module for each dataset to correct gaze estimates from a single shared estimator. This enables us to combine information across datasets despite differences in labeling. Our experi-ments show that these innovations improve gaze estimation performance over the SOTA both individually and collectively (by 10% - 20%). Our code is available at this https URL.
为了训练和测试基于外观的注视估算器,已经创建了多个数据集。直觉上,更多的数据应该导致更好的性能。然而,将数据集合并来训练单个估计器 rarely 能够提高注视估算器的性能。一个可能的原因是用于获取 gaze 样本的实验协议不同,导致头部的分布、眼视角、照明等分布存在差异。另一个原因是定义 gaze angles 的方法不一致(标签不匹配)。我们提出了两种创新方法,以利用多个数据集来提高注视估算器的性能,包括对模型的架构进行修改和引入凝视适应模块。大多数最先进的估计算法将来自两个眼睛和整个脸部的图像的信息并行或先从眼睛的信息,然后与脸部信息合并。我们提出的 Two-stage Transformer-based Gaze-feature Fusion (TTGF) 方法使用转换器将来自每个眼睛和脸部的信息单独合并,然后在不同眼睛之间合并。我们认为,这改善了头部的姿态不变性,因为头部的姿态对左和右眼睛的图像有不同的影响。我们提出的 Gaze Adaptation Module (GAM) 方法通过为每个数据集应用凝视适应模块来纠正共享估计算法中单个视角的注视估计。这使我们能够结合数据集之间的信息,尽管存在标签不匹配的问题。我们的实验结果表明,这些创新方法既个人又集体地提高了注视估算器的性能(10% - 20%)。我们的代码可在此处访问:https://url.in/
https://arxiv.org/abs/2409.00912
Parsing of eye components (i.e. pupil, iris and sclera) is fundamental for eye tracking and gaze estimation for AR/VR products. Mainstream approaches tackle this problem as a multi-class segmentation task, providing only visible part of pupil/iris, other methods regress elliptical parameters using human-annotated full pupil/iris parameters. In this paper, we consider two priors: projected full pupil/iris circle can be modelled with ellipses (ellipse prior), and the visibility of pupil/iris is controlled by openness of eye-region (condition prior), and design a novel method CondSeg to estimate elliptical parameters of pupil/iris directly from segmentation labels, without explicitly annotating full ellipses, and use eye-region mask to control the visibility of estimated pupil/iris ellipses. Conditioned segmentation loss is used to optimize the parameters by transforming parameterized ellipses into pixel-wise soft masks in a differentiable way. Our method is tested on public datasets (OpenEDS-2019/-2020) and shows competitive results on segmentation metrics, and provides accurate elliptical parameters for further applications of eye tracking simultaneously.
解析眼睛部件(即瞳孔、虹膜和眼白)对于AR/VR产品中的跟踪和眼球运动估计是至关重要的。主流方法将这个问题视为多类分割任务,仅提供可见的瞳孔/虹膜部分,其他方法通过人标注的完整瞳孔/虹膜参数回归椭圆参数。在本文中,我们考虑两个假设:投影的完整瞳孔/虹膜圆可以建模为椭圆(椭圆假设),而瞳孔/虹膜的可见性由眼区域的开阔程度(条件假设)控制,我们设计了一种名为CondSeg的方法,可以直接从分割标签估计瞳孔/虹膜的椭圆参数,而无需明确标注完整的椭圆。我们使用眼区域掩码来控制估计瞳孔/虹膜椭圆的可见性。通过条件分割损失函数以通过变换参数化的椭圆为不同维度的软掩码来优化参数。我们的方法在公开数据集(OpenEDS-2019/-2020)上进行了测试,在分割指标上表现出竞争力的结果,并为同时应用于其他眼跟踪应用提供准确的椭圆参数。
https://arxiv.org/abs/2408.17231
The availability of extensive datasets containing gaze information for each subject has significantly enhanced gaze estimation accuracy. However, the discrepancy between domains severely affects a model's performance explicitly trained for a particular domain. In this paper, we propose the Causal Representation-Based Domain Generalization on Gaze Estimation (CauGE) framework designed based on the general principle of causal mechanisms, which is consistent with the domain difference. We employ an adversarial training manner and an additional penalizing term to extract domain-invariant features. After extracting features, we position the attention layer to make features sufficient for inferring the actual gaze. By leveraging these modules, CauGE ensures that the neural networks learn from representations that meet the causal mechanisms' general principles. By this, CauGE generalizes across domains by extracting domain-invariant features, and spurious correlations cannot influence the model. Our method achieves state-of-the-art performance in the domain generalization on gaze estimation benchmark.
大量数据集的可用性大大提高了 gaze 估计的准确性。然而,领域之间的差异对模型为特定领域进行训练的性能产生严重影响。在本文中,我们提出了基于因果机制的领域泛化在 gaze 估计(CaGE)框架,该框架基于因果机制的一般原则,与领域差异相一致。我们采用对抗训练方法和额外的惩罚项来提取域内特征。在提取特征后,我们将注意力层放置为使特征足够用于推断实际 gaze。通过利用这些模块,CaGE确保神经网络从满足因果机制一般原则的表示中学习。通过这种方式,CaGE在领域通用性上超越了模型。我们的方法在域通用性上实现了与最先进的 gaze 估计基准相当的表现。
https://arxiv.org/abs/2408.16964
Gaze estimation is pivotal in human scene comprehension tasks, particularly in medical diagnostic analysis. Eye-tracking technology facilitates the recording of physicians' ocular movements during image interpretation, thereby elucidating their visual attention patterns and information-processing strategies. In this paper, we initially define the context-aware gaze estimation problem in medical radiology report settings. To understand the attention allocation and cognitive behavior of radiologists during the medical image interpretation process, we propose a context-aware Gaze EstiMation (GEM) network that utilizes eye gaze data collected from radiologists to simulate their visual search behavior patterns throughout the image interpretation process. It consists of a context-awareness module, visual behavior graph construction, and visual behavior matching. Within the context-awareness module, we achieve intricate multimodal registration by establishing connections between medical reports and images. Subsequently, for a more accurate simulation of genuine visual search behavior patterns, we introduce a visual behavior graph structure, capturing such behavior through high-order relationships (edges) between gaze points (nodes). To maintain the authenticity of visual behavior, we devise a visual behavior-matching approach, adjusting the high-order relationships between them by matching the graph constructed from real and estimated gaze points. Extensive experiments on four publicly available datasets demonstrate the superiority of GEM over existing methods and its strong generalizability, which also provides a new direction for the effective utilization of diverse modalities in medical image interpretation and enhances the interpretability of models in the field of medical imaging. this https URL
目光估计在人体场景理解任务中具有重要地位,特别是在医学诊断分析中。眼跟踪技术通过记录医生在图像解释过程中眼球运动的数据,从而揭示了他们的视觉关注模式和信息处理策略。在本文中,我们首先定义了医学影像报告环境中具有上下文意识的目光估计问题。为了了解放射员在医学图像解释过程中分配注意力和认知行为,我们提出了一个具有上下文意识的目光估计(GEM)网络,该网络利用从放射员收集的眼部运动数据来模拟他们在图像解释过程中的视觉搜索行为模式。它包括上下文意识模块、视觉行为图构建和视觉行为匹配。在上下文意识模块内,我们通过建立医学报告和图像之间的连接,实现了复杂的多模态注册。接着,为了更准确地模拟真实视觉搜索行为模式,我们引入了视觉行为图结构,通过高阶关系(边)来捕捉这种行为。为了保持视觉行为的真实性,我们设计了一种视觉行为匹配方法,通过匹配真实和估计的眼部运动数据来调整它们之间的高阶关系。在四个公开可用的数据集上进行的大量实验证明,GEM 相对于现有方法具有优越性,并且具有很强的泛化能力,这为在医学图像解释中有效地利用各种模态提供了新的方向,同时也提高了模型的可解释性。这个链接:
https://arxiv.org/abs/2408.05502
Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze tracking, a critical enabler for many ubiquitous applications and has triggered the development of easy-to-use gaze estimation services. Indeed, by utilizing the ubiquitous cameras on tablets and smartphones, users can readily access many gaze estimation services. In using these services, users must provide their full-face images to the gaze estimator, which is often a black box. This poses significant privacy threats to the users, especially when a malicious service provider gathers a large collection of face images to classify sensitive user attributes. In this work, we present PrivateGaze, the first approach that can effectively preserve users' privacy in black-box gaze tracking services without compromising gaze estimation performance. Specifically, we proposed a novel framework to train a privacy preserver that converts full-face images into obfuscated counterparts, which are effective for gaze estimation while containing no privacy information. Evaluation on four datasets shows that the obfuscated image can protect users' private information, such as identity and gender, against unauthorized attribute classification. Meanwhile, when used directly by the black-box gaze estimator as inputs, the obfuscated images lead to comparable tracking performance to the conventional, unprotected full-face images.
眼神包含了关于人类注意力和认知过程的丰富信息。这种能力使得 gaze tracking 成为许多普遍应用的关键使能器,并引发了易于使用的眼神估计服务的开发。事实上,通过利用平板电脑和智能手机上的普遍摄像头,用户可以轻松访问许多眼神估计服务。在使用这些服务时,用户必须向眼神估计算提供完整的面容图像,通常是一个黑盒。这给用户带来了显著的隐私威胁,尤其是在恶意服务提供商收集大量面容图像以分类敏感用户属性时。 在本文中,我们提出了 PrivateGaze,是第一个在不影响眼神估计性能的同时有效保护用户隐私的黑盒 gaze tracking 服务的方法。具体来说,我们提出了一个新框架来训练一个隐私保护者,使其将完整面容图像转换为包含隐私信息但不影响眼神估计效果的遮罩图像。在四个数据集上的评估显示,遮罩图像可以保护用户的隐私信息,如身份和性别,免受未经授权的属性分类。同时,当被黑盒眼神估计算直接使用时,遮罩图像与传统未保护的完整面容图像的追踪性能相当。
https://arxiv.org/abs/2408.00950
Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.
基于外观的监督方法在最近的光注估计任务中取得了巨大的进展。然而,强烈的标注需求抑制了当前方法实现工业级精度和鲁棒性。尽管当前的无监督预训练框架在许多图像识别任务中取得了成功,但由于面部和眼睛特征之间的深入耦合,这些框架仍然缺乏从全脸上提取有用的目光特征。为了减轻上述限制,本文提出了一种新颖的无监督/自监督目光预训练框架,通过合作特征对比和挤压模块,迫使全脸分支学习一个低维度的目光嵌入,而无需 gaze 注释。这个框架的核心是一个交替的眼关注/不关注掩码训练计划,通过通过注入瓶颈设计将目光相关的信息从全脸分支压缩到眼掩码自编码器中,从而成功引导模型更加关注目光方向而非面部纹理,同时仍然采用眼自重建目标。同时,为了进一步提高学习到的表示,还设计了一个新颖的眼/目光相关信息的对比损失,强制模型将注意力集中在眼中心区域。在多个目光基准测试上进行的大量实验结果表明,与无监督状态下的最先进方法相比,所提出的方案具有卓越的性能。
https://arxiv.org/abs/2407.00315
This study presents a novel framework for 3D gaze tracking tailored for mixed-reality settings, aimed at enhancing joint attention and collaborative efforts in team-based scenarios. Conventional gaze tracking, often limited by monocular cameras and traditional eye-tracking apparatus, struggles with simultaneous data synchronization and analysis from multiple participants in group contexts. Our proposed framework leverages state-of-the-art computer vision and machine learning techniques to overcome these obstacles, enabling precise 3D gaze estimation without dependence on specialized hardware or complex data fusion. Utilizing facial recognition and deep learning, the framework achieves real-time, tracking of gaze patterns across several individuals, addressing common depth estimation errors, and ensuring spatial and identity consistency within the dataset. Empirical results demonstrate the accuracy and reliability of our method in group environments. This provides mechanisms for significant advances in behavior and interaction analysis in educational and professional training applications in dynamic and unstructured environments.
https://arxiv.org/abs/2406.11003