Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.
眼神估计,即预测凝视方向的任务,通常会面临来自面部图像中与凝视无关的复杂信息干扰。为此,在这项工作中,我们提出了一种新的目光估计框架——DMAGaze,该框架从三个方面利用了面部图像中的信息:相关的全局特征(从面部图像中分离出来)、局部眼睛特征(从裁剪出的眼睛区域提取)和头部姿态估计特征,以提升整体性能。 首先,我们设计了一个新的基于连续掩码的解耦器(Disentangler),通过分别重建眼部和非眼部区域来准确地将与凝视相关的信息与无关信息分离出来,从而实现双分支解耦的目标。此外,我们还引入了一种新的级联注意力模块——多尺度全局局部注意模块(Multi-Scale Global Local Attention Module, MS-GLAM)。通过定制的级联注意力结构,该模块能够有效关注多个尺度上的全局和局部信息,进一步增强了解耦器提供的信息。最后,上部面部分支解耦出的整体凝视相关特征与头部姿态和局部眼睛特征相结合,并通过检测头进行高精度的目光估计。 我们提出的DMAGaze框架已在两个主流公开数据集上进行了广泛验证,并取得了最先进的性能水平。
https://arxiv.org/abs/2504.11160
Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.
近期在硬件、计算机图形和人工智能方面的进展可能会使增强现实/虚拟现实头戴式显示器(HMD)成为像智能手机和平板电脑一样日常使用的设备。HMD 中的眼动追踪器为这种设置提供了特殊的机会,因为它可以促进基于注视的研究和交互。然而,估计用户的目光信息通常需要包含虹膜纹理的原始眼图像和视频,而这些被视作用于身份认证的标准生物特征数据,这引发了隐私方面的担忧。此前,在眼动追踪领域的研究主要集中在模糊化虹膜纹理的同时保持诸如视线估计等实用任务的准确性。尽管有这些尝试,但尚无全面评估前沿方法的基准测试。 考虑到所有因素,本文对模糊、噪声处理、降采样、橡胶片模型和虹膜风格转换这五种技术进行基准测试,以混淆用户身份,并比较它们在两个数据集上对图像质量、隐私性、实用性和冒充攻击风险的影响。我们使用眼睛分割和注视估计作为实用性任务,将虹膜识别准确性的降低视为隐私保护的衡量标准,并用错误接受率来估算攻击的风险。 我们的实验表明,传统的图像处理方法如模糊和添加噪声对于基于深度学习的任务影响甚微。然而,降采样、橡胶片模型以及虹膜风格转换在隐藏用户标识符方面都表现出色,其中虹膜风格转换以更高的计算成本为代价,在实用性任务中表现优于其他技术,并且更能抵御冒充攻击。 我们的分析表明,没有一种普遍适用的方法可以在隐私性、实用性和计算负担之间达到平衡。因此,我们建议实践者应考虑每种方法的优点和缺点,以及这些方法可能的组合方式,以实现最佳的隐私与实用性权衡。
https://arxiv.org/abs/2504.10267
This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study, we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot's explanations for failures to enhance the collaborative experience.
这项工作旨在解读人类行为,以便在机器人提供故障解释时预测潜在的用户困惑,从而使机器人能够根据情况调整其解释方式,以实现更自然和高效的协作。我们使用了一个数据集,该数据集中包含了55名参与者在用户研究中表现出的面部表情检测、目光估计以及手势等信息,分析了人类行为如何随着不同类型故障及不同解释水平的变化而变化。我们的目标是评估人类合作者是否能够在不引起困惑的情况下接受更简洁的解释。 我们制定了一个基于数据驱动的方法来预测机器人在提供故障解释过程中的人类困惑程度,并提出了并评估了一种机制,该机制可以根据观察到的人类行为调整解释水平。这项研究的初步结果表明,根据人类的行为适应机器人的解释方式具有增强协作体验的潜力。
https://arxiv.org/abs/2504.09717
Iris texture is widely regarded as a gold standard biometric modality for authentication and identification. The demand for robust iris recognition methods, coupled with growing security and privacy concerns regarding iris attacks, has escalated recently. Inspired by neural style transfer, an advanced technique that leverages neural networks to separate content and style features, we hypothesize that iris texture's style features provide a reliable foundation for recognition and are more resilient to variations like rotation and perspective shifts than traditional approaches. Our experimental results support this hypothesis, showing a significantly higher classification accuracy compared to conventional features. Further, we propose using neural style transfer to mask identifiable iris style features, ensuring the protection of sensitive biometric information while maintaining the utility of eye images for tasks like eye segmentation and gaze estimation. This work opens new avenues for iris-oriented, secure, and privacy-aware biometric systems.
虹膜纹理被广泛认为是身份认证和识别的黄金标准生物特征模态。随着对可靠虹膜识别方法的需求日益增加,以及关于虹膜攻击的安全性和隐私问题的关注加剧,最近人们对这一领域产生了更大的兴趣。受神经风格迁移(一种利用神经网络分离内容与风格特征的先进技术)启发,我们假设虹膜纹理中的风格特征可以为识别提供一个可靠的依据,并且比传统方法更能抵抗旋转和透视变化带来的影响。我们的实验结果支持了这一假设,显示出相比传统的特征提取方法,分类准确率有了显著提升。 此外,我们提出使用神经风格迁移来掩盖可识别人脸风格的虹膜特征,以确保敏感生物识别信息的安全性,同时保持眼睛图像在诸如眼球分割和注视估计等任务中的实用性。这项工作为面向虹膜、安全且隐私意识强的生物识别系统开辟了新的途径。
https://arxiv.org/abs/2503.04707
Gaze estimation models are widely used in applications such as driver attention monitoring and human-computer interaction. While many methods for gaze estimation exist, they rely heavily on data-hungry deep learning to achieve high performance. This reliance often forces practitioners to harvest training data from unverified public datasets, outsource model training, or rely on pre-trained models. However, such practices expose gaze estimation models to backdoor attacks. In such attacks, adversaries inject backdoor triggers by poisoning the training data, creating a backdoor vulnerability: the model performs normally with benign inputs, but produces manipulated gaze directions when a specific trigger is present. This compromises the security of many gaze-based applications, such as causing the model to fail in tracking the driver's attention. To date, there is no defense that addresses backdoor attacks on gaze estimation models. In response, we introduce SecureGaze, the first solution designed to protect gaze estimation models from such attacks. Unlike classification models, defending gaze estimation poses unique challenges due to its continuous output space and globally activated backdoor behavior. By identifying distinctive characteristics of backdoored gaze estimation models, we develop a novel and effective approach to reverse-engineer the trigger function for reliable backdoor detection. Extensive evaluations in both digital and physical worlds demonstrate that SecureGaze effectively counters a range of backdoor attacks and outperforms seven state-of-the-art defenses adapted from classification models.
注视估计模型广泛应用于驾驶员注意力监控和人机交互等应用中。尽管存在许多注视估计的方法,但这些方法往往依赖于数据密集型的深度学习技术来实现高性能。这种对大量训练数据的需求经常迫使从业者从未经验证的公共数据集中收集训练数据、外包模型训练或依赖预训练模型。然而,这样的做法会使得注视估计模型容易遭受后门攻击。在这些攻击中,对手通过污染训练数据注入后门触发器,在模型中创建一个后门漏洞:模型在接受正常输入时表现正常,但在特定触发器存在的情况下会产生操纵后的注视方向。这破坏了许多基于注视的应用程序的安全性,例如导致模型无法正确追踪驾驶员的注意力。迄今为止,还没有针对这类攻击进行防御的方法。 为解决这个问题,我们引入了SecureGaze,这是首个旨在保护注视估计模型免受此类攻击威胁的解决方案。与分类模型不同的是,由于其连续输出空间和全局激活的后门行为,保护注视估计面临着独特的挑战。通过识别带有后门的注视估计模型的独特特性,我们开发了一种新颖且有效的方法来反向工程触发函数,从而实现可靠的后门检测。 在数字世界和物理世界的广泛评估中显示,SecureGaze能够有效地对抗一系列后门攻击,并优于七种从分类模型改编而来的最新防御方法。
https://arxiv.org/abs/2502.20306
Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-WSGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.
在不受限制的真实世界环境中进行精确的三维注视估计仍然是一个重大挑战,这主要是由于外观变化、头部姿态的变化、遮挡以及缺乏大规模的实际三维注视数据集所致。为了解决这些问题,我们提出了一种新的自我训练弱监督注视估计框架(Self-Training Weakly-Supervised Gaze Estimation, ST-WSGE)。这一双阶段学习框架利用了各种二维注视数据集(如跟随注视的数据),这些数据集提供了丰富的外观变化、自然场景和注视分布的多样性,并提出了生成三维伪标签的方法以增强模型泛化能力。此外,传统的特定模式设计的模型(分别针对图像或视频进行设计)限制了现有训练数据的有效利用。为了解决这一问题,我们提出了一种模态无关架构——注视变换器(Gaze Transformer, GaT),该架构能够同时从图像和视频数据集中学习静态和动态的注视信息。通过结合3D视频数据集与来自跟随注视任务的2D注视目标标签,我们的方法实现了以下关键贡献: (i) 在不受限制基准测试(如Gaze360和GFIE)中的域内和跨域泛化方面取得了显著的最新技术水平改进,并在视频注视估计中获得了显着的跨模态收益; (ii) 与面向正面脸部的方法相比,在MPIIFaceGaze和Gaze360等数据集上展现了优越的跨域性能。 代码和预训练模型将向社区开放。
https://arxiv.org/abs/2502.20249
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel CLIP-driven Dual Feature Enhancing Network (CLIP-DFENet), which boosts gaze estimation performance with the help of CLIP under a novel `main-side' collaborative enhancing strategy. Accordingly, a Language-driven Differential Module (LDM) is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gaze. This module could empower our Core Feature Extractor with the capability of characterizing the gaze-related semantic information. Moreover, a Vision-driven Fusion Module (VFM) is introduced to strengthen the generalized and valuable components of visual embeddings obtained via CLIP's image encoder, and utilizes them to further improve the generalization of the features captured by Core Feature Extractor. Finally, a robust Double-head Gaze Regressor is adopted to map the enhanced features to gaze directions. Extensive experimental results on four challenging datasets over within-domain and cross-domain tasks demonstrate the discriminability and generalizability of our CLIP-DFENet.
复杂的应用场景提出了对精确且通用的注视估计方法的关键需求。最近,预训练的CLIP在各种视觉任务上取得了显著性能,但其在注视估计中的潜力尚未被充分挖掘。本文中,我们提出了一种新颖的基于CLIP的双特征增强网络(CLIP-DFENet),该网络利用CLIP并通过一种创新的“主侧”协作增强策略来提升注视估计的表现。为此,设计了一个以CLIP的文字编码器为基础的语言驱动差异模块(LDM)来揭示注视的语义差异。此模块能够赋予我们的核心特征提取器识别与注视相关语义信息的能力。此外,引入了一种视觉驱动融合模块(VFM),通过增强从CLIP图像编码器获得的视觉嵌入中的通用且有价值的组件,并利用这些组件进一步提升由核心特征提取器捕获特征的一般化能力。最后,采用了一个稳健的双头注视回归器来将增强后的特征映射到注视方向上。在四个具有挑战性的数据集上的广泛实验结果显示了我们提出的CLIP-DFENet模型在域内和跨域任务中的区分能力和泛化能力。
https://arxiv.org/abs/2502.20128
Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.
基于自我的视频注视估计要求模型能够捕捉个体的注视模式,并适应多样化的用户数据。我们的方法采用了一种基于Transformer架构的方法,将其整合到一个PFL(个性化联邦学习)框架中,在此框架中,仅选择在训练过程中变化最大的重要参数进行冻结和个性化处理,以应用于客户端模型。通过在EGTEA Gaze+ 和 Ego4D 数据集上进行广泛的实验,我们展示了FedCPF方法显著优于之前报道的联邦学习方法,在召回率、精确度和F1分数方面均表现出色。这些结果确认了我们的全面参数冻结策略能够有效增强模型个性化能力,并使FedCPF成为在需要高度适应性和准确性的联邦学习任务中极具前景的方法。
https://arxiv.org/abs/2502.18123
Mobile gaze tracking involves inferring a user's gaze point or direction on a mobile device's screen from facial images captured by the device's front camera. While this technology inspires an increasing number of gaze-interaction applications, achieving consistent accuracy remains challenging due to dynamic user-device spatial relationships and varied motion conditions inherent in mobile contexts. This paper provides empirical evidence on how user mobility and behaviour affect mobile gaze tracking accuracy. We conduct two user studies collecting behaviour and gaze data under various motion conditions - from lying to maze navigation - and during different interaction tasks. Quantitative analysis has revealed behavioural regularities among daily tasks and identified head distance, head pose, and device orientation as key factors affecting accuracy, with errors increasing by up to 48.91% in dynamic conditions compared to static ones. These findings highlight the need for more robust, adaptive eye-tracking systems that account for head movements and device deflection to maintain accuracy across diverse mobile contexts.
移动凝视追踪技术通过分析设备前置摄像头捕捉到的面部图像来推断用户在移动设备屏幕上注视点或方向。虽然这项技术激发了越来越多以凝视为交互的应用程序,但由于移动环境中的用户与设备之间的空间关系不断变化以及运动条件多样,保持高精度仍然是一个挑战。本文提供了实证证据,说明用户的移动行为如何影响移动凝视追踪的准确性。我们进行了两项用户研究,在各种不同的运动条件下(从躺卧到迷宫导航)和不同交互任务期间收集了行为和注视数据。定量分析揭示了日常任务中的行为规律,并确定头部距离、头部姿态以及设备方向是影响精度的关键因素,与静态条件相比,在动态条件下错误率最多可增加48.91%。这些发现强调需要开发更加稳健且适应性强的眼动追踪系统,以应对头部运动和设备偏移的影响,从而在多种移动环境中保持准确性。
https://arxiv.org/abs/2502.10570
3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for this misalignment. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices. The superior performance highlights the effectiveness of our approach, and demonstrates its strong potential for real-world applications.
3D和2D注视估计虽然有着共同的基本目标,即捕捉眼球运动,但传统上被视为两个独立的研究领域。在本文中,我们提出了一种新颖的跨任务少样本2D注视估计方法,旨在利用少量训练图像,将预训练好的3D注视估算网络适应到未见过设备上的2D注视预测。这项任务由于3D和2D注视之间的域差距、未知屏幕姿态以及有限的训练数据而极具挑战性。 为了解决这些难题,我们提出了一种新颖的框架来弥合3D和2D注视间的鸿沟。该框架包含一个基于物理原理的可学习参数化差分投影模块,用于建模屏幕姿态并将3D注视转换成2D注视。整个框架是全差异化的,并且能够不改变原有架构的情况下整合到现有的3D注视网络中。 此外,我们还引入了一种针对翻转图像的动态伪标签策略,这对未知屏幕姿态下的2D标签来说尤其具有挑战性。为克服这一困难,我们将转换过程逆向进行,即将2D标签转化为3D空间,在那里执行翻转操作。值得注意的是,这个3D空间并未与相机坐标系统对齐,因此我们学习了一个动态变换矩阵来补偿这种错位。 我们在MPIIGaze、EVE和GazeCapture数据集上评估了我们的方法,这些数据分别采集自笔记本电脑、台式计算机和移动设备上。我们的方法表现出色,并突显了其在现实世界应用中的强大潜力。
https://arxiv.org/abs/2502.04074
Despite decades of research on data collection and model architectures, current gaze estimation models face significant challenges in generalizing across diverse data domains. While recent advances in self-supervised pre-training have shown remarkable potential for improving model generalization in various vision tasks, their effectiveness in gaze estimation remains unexplored due to the geometric nature of the gaze regression task. We propose UniGaze, which leverages large-scale, in-the-wild facial datasets through self-supervised pre-training for gaze estimation. We carefully curate multiple facial datasets that capture diverse variations in identity, lighting, background, and head poses. By directly applying Masked Autoencoder (MAE) pre-training on normalized face images with a Vision Transformer (ViT) backbone, our UniGaze learns appropriate feature representations within the specific input space required by downstream gaze estimation models. Through comprehensive experiments using challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. The source code and pre-trained models will be released upon acceptance.
尽管在数据收集和模型架构方面进行了数十年的研究,当前的视线估计模型仍然面临着跨不同数据域进行泛化的重大挑战。虽然最近在自我监督预训练方面的进展已经在各种视觉任务中展示了显著改善模型泛化的能力,但由于注视回归任务的几何特性,其在注视估计中的有效性尚未被探索。我们提出了UniGaze,该系统利用大规模、现实环境中的人脸数据集通过自我监督预训练来进行注视估计。我们精心策划了多个捕捉身份、光照、背景和头部姿态多种变化的人脸数据集。通过直接将遮罩自动编码器(MAE)预训练应用于具有视觉变换器(ViT)骨干的标准化面部图像,我们的UniGaze学习到了下游注视估计模型所需特定输入空间内的适当特征表示。通过使用具有挑战性的跨数据集评估和包括留一数据集外和联合数据集设置在内的新型协议进行详尽实验,我们证明了UniGaze在多个数据域中的泛化性能得到了显著提升,并且最大限度地减少了对昂贵标注数据的依赖。源代码和预训练模型将在接受后发布。
https://arxiv.org/abs/2502.02307
Current deep learning powered appearance based uncertainty-aware gaze estimation models produce inconsistent and unreliable uncertainty estimation that limits their adoptions in downstream applications. In this study, we propose a workflow to improve the accuracy of uncertainty estimation using probability calibration with a few post hoc samples. The probability calibration process employs a simple secondary regression model to compensate for inaccuracies in estimated uncertainties from the deep learning model. Training of the secondary model is detached from the main deep learning model and thus no expensive weight tuning is required. The added calibration process is lightweight and relatively independent from the deep learning process, making it fast to run and easy to implement. We evaluated the effectiveness of the calibration process under four potential application scenarios with two datasets that have distinctive image characteristics due to the data collection setups. The calibration process is most effective when the calibration and testing data share similar characteristics. Even under suboptimal circumstances that calibration and testing data differ, the calibration process can still make corrections to reduce prediction errors in uncertainty estimates made by uncalibrated models.
当前基于深度学习的外观不确定性感知凝视估计模型所产生的不确定性估计不一致且不可靠,这限制了它们在下游应用中的采用。在这项研究中,我们提出了一种使用概率校准和少量后处理样本来提高不确定度估计准确性的工作流程。概率校准过程利用一个简单的二次回归模型来补偿深度学习模型估算的不确定性中的不准确性。该二次模型的训练独立于主要的深度学习模型,因此无需昂贵的权重调整。所增加的校准过程轻量级且相对独立于深度学习过程,使其运行快速且易于实施。 我们使用具有不同图像特征的两个数据集,在四种潜在的应用场景下评估了校准过程的有效性(这些特性源于数据收集设置)。当校准和测试数据具有一些相似特征时,校准过程最有效。即使在不太理想的情况下,即校准和测试数据存在差异时,校准过程仍能进行修正以减少未经校准模型做出的不确定性预测误差。
https://arxiv.org/abs/2501.14894
Gaze estimation methods encounter significant performance deterioration when being evaluated across different domains, because of the domain gap between the testing and training data. Existing methods try to solve this issue by reducing the deviation of data distribution, however, they ignore the existence of label deviation in the data due to the acquisition mechanism of the gaze label and the individual physiological differences. In this paper, we first point out that the influence brought by the label deviation cannot be ignored, and propose a gaze label alignment algorithm (GLA) to eliminate the label distribution deviation. Specifically, we first train the feature extractor on all domains to get domain invariant features, and then select an anchor domain to train the gaze regressor. We predict the gaze label on remaining domains and use a mapping function to align the labels. Finally, these aligned labels can be used to train gaze estimation models. Therefore, our method can be combined with any existing method. Experimental results show that our GLA method can effectively alleviate the label distribution shift, and SOTA gaze estimation methods can be further improved obviously.
https://arxiv.org/abs/2412.15601
Uncertainty in gaze estimation manifests in two aspects: 1) low-quality images caused by occlusion, blurriness, inconsistent eye movements, or even non-face images; 2) incorrect labels resulting from the misalignment between the labeled and actual gaze points during the annotation process. Allowing these uncertainties to participate in training hinders the improvement of gaze estimation. To tackle these challenges, in this paper, we propose an effective solution, named Suppressing Uncertainty in Gaze Estimation (SUGE), which introduces a novel triplet-label consistency measurement to estimate and reduce the uncertainties. Specifically, for each training sample, we propose to estimate a novel ``neighboring label'' calculated by a linearly weighted projection from the neighbors to capture the similarity relationship between image features and their corresponding labels, which can be incorporated with the predicted pseudo label and ground-truth label for uncertainty estimation. By modeling such triplet-label consistency, we can measure the qualities of both images and labels, and further largely reduce the negative effects of unqualified images and wrong labels through our designed sample weighting and label correction strategies. Experimental results on the gaze estimation benchmarks indicate that our proposed SUGE achieves state-of-the-art performance.
凝视估计中的不确定性体现在两个方面:1) 由遮挡、模糊、不一致的眼部运动,甚至是非人脸图像引起的低质量图像;2) 在标注过程中,标记的注视点与实际注视点之间的错位导致错误标签。允许这些不确定性参与训练会阻碍凝视估计算法性能的提升。为解决这些问题,在本文中我们提出了一种有效的解决方案,称为抑制凝视估计中的不确定性(SUGE),该方法引入了一种新的三重标签一致性测量方式来估算并减少不确定性。具体来说,对于每个训练样本,我们建议估算一个通过从邻居处进行线性加权投影计算得出的新型“邻近标签”,以捕捉图像特征与其对应标签之间的相似关系。此邻近标签可以与预测的伪标签和真实标签一起用于不确定性估计。通过建模这种三重标签一致性,我们可以衡量图像和标签的质量,并进一步通过我们设计的样本权重调整及标签校正策略大幅减少不合格图像和错误标签带来的负面影响。在凝视估计算法基准测试中的实验结果表明,我们提出的SUGE达到了最先进的性能水平。
https://arxiv.org/abs/2412.12890
In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maximization algorithm. First, the proposed Global Attention Mechanism(GAM) is added to extract features related to gaze estimation to improve the model's ability to capture global dependencies and thus improve its performance. Second, by learning hierarchical feature representations through the EM module, the model has strong generalization ability, which reduces the need for sample size. Experiments have confirmed that, on the premise of using only 50% of the training data, EM-Net improves the performance of Gaze360, MPIIFaceGaze, and RT-Gene datasets by 2.2%, 2.02%, and 2.03%, respectively, compared with GazeNAS-ETH. It also shows good robustness in the face of Gaussian noise interference.
近年来,凝视估计技术的准确性逐渐提高,但现有的方法通常依赖于大规模数据集或大型模型来提升性能,这导致对计算资源有较高的需求。针对这一问题,本文提出了一种基于深度学习和传统机器学习算法——期望最大化(EM)算法的轻量级凝视估计模型 EM-Net。首先,提出的全局注意力机制(GAM)被用于提取与凝视估计相关的特征,以增强模型捕捉全局依赖关系的能力,从而提升其性能。其次,通过 EM 模块学习分层特征表示,该模型具有较强的泛化能力,减少了对样本量的需求。实验结果表明,在仅使用 50% 训练数据的情况下,EM-Net 相比 GazeNAS-ETH 分别提升了 Gaze360、MPIIFaceGaze 和 RT-Gene 数据集的性能 2.2%、2.02% 和 2.03%。此外,在面对高斯噪声干扰时,该模型也表现出良好的鲁棒性。
https://arxiv.org/abs/2412.08074
Using lightweight models as backbone networks in gaze estimation tasks often results in significant performance degradation. The main reason is that the number of feature channels in lightweight networks is usually small, which makes the model expression ability limited. In order to improve the performance of lightweight models in gaze estimation tasks, a network model named Multitask-Gaze is proposed. The main components of Multitask-Gaze include Unidirectional Convolution (UC), Spatial and Channel Attention (SCA), Global Convolution Module (GCM), and Multi-task Regression Module(MRM). UC not only significantly reduces the number of parameters and FLOPs, but also extends the receptive field and improves the long-distance modeling capability of the model, thereby improving the model performance. SCA highlights gaze-related features and suppresses gaze-irrelevant features. The GCM replaces the pooling layer and avoids the performance degradation due to information loss. MRM improves the accuracy of individual tasks and strengthens the connections between tasks for overall performance improvement. The experimental results show that compared with the State-of-the-art method SUGE, the performance of Multitask-Gaze on MPIIFaceGaze and Gaze360 datasets is improved by 1.71% and 2.75%, respectively, while the number of parameters and FLOPs are significantly reduced by 75.5% and 86.88%.
使用轻量级模型作为凝视估计任务的骨干网络通常会导致性能显著下降。主要原因在于轻量级网络中的特征通道数量通常较少,这限制了模型的表现能力。为了提高轻量级模型在凝视估计任务中的表现,提出了一种名为Multitask-Gaze的网络模型。Multitask-Gaze的主要组成部分包括单向卷积(UC)、空间和通道注意力(SCA)、全局卷积模块(GCM)以及多任务回归模块(MRM)。UC不仅大幅减少了参数数量和计算量(FLOPs),还扩展了感受野并提升了模型的长距离建模能力,从而提高了模型性能。SCA突出与凝视相关的特征,并抑制无关特征。GCM替换了池化层,避免了由于信息丢失导致的表现下降。MRM提高了各个任务的准确性,并加强了任务之间的联系以提升整体表现。实验结果显示,相比于最先进的方法SUGE,在MPIIFaceGaze和Gaze360数据集上,Multitask-Gaze的表现分别提升了1.71%和2.75%,同时参数数量和计算量(FLOPs)分别显著减少了75.5%和86.88%。
https://arxiv.org/abs/2411.18061
Gaze estimation encounters generalization challenges when dealing with out-of-distribution data. To address this problem, recent methods use neural radiance fields (NeRF) to generate augmented data. However, existing methods based on NeRF are computationally expensive and lack facial details. 3D Gaussian Splatting (3DGS) has become the prevailing representation of neural fields. While 3DGS has been extensively examined in head avatars, it faces challenges with accurate gaze control and generalization across different subjects. In this work, we propose GazeGaussian, a high-fidelity gaze redirection method that uses a two-stream 3DGS model to represent the face and eye regions separately. By leveraging the unstructured nature of 3DGS, we develop a novel eye representation for rigid eye rotation based on the target gaze direction. To enhance synthesis generalization across various subjects, we integrate an expression-conditional module to guide the neural renderer. Comprehensive experiments show that GazeGaussian outperforms existing methods in rendering speed, gaze redirection accuracy, and facial synthesis across multiple datasets. We also demonstrate that existing gaze estimation methods can leverage GazeGaussian to improve their generalization performance. The code will be available at: this https URL.
当处理分布外数据时,凝视估计会遇到泛化挑战。为了解决这一问题,最近的方法使用神经辐射场(NeRF)生成增强数据。然而,基于NeRF的现有方法计算成本高昂,并且缺乏面部细节。3D高斯喷涂(3DGS)已成为神经领域的主要表示形式。虽然3DGS在头部化身方面得到了广泛研究,但在精确控制凝视和跨不同个体泛化方面仍面临挑战。在本工作中,我们提出了GazeGaussian,这是一种使用双流3DGS模型分别表示脸部和眼部区域的高保真凝视重定向方法。通过利用3DGS的非结构特性,我们开发了一种基于目标凝视方向的刚性眼旋转的新眼睛表示形式。为了增强跨不同个体的合成泛化能力,我们将表情条件模块集成到神经渲染器中以提供指导。综合实验表明,GazeGaussian在渲染速度、凝视重定向准确性和多数据集面部合成方面优于现有方法。我们还展示了现有的凝视估计方法可以利用GazeGaussian来提升其泛化性能。代码将在以下网址提供:此 https URL。
https://arxiv.org/abs/2411.12981
The ability of gaze estimation models to generalize is often significantly hindered by various factors unrelated to gaze, especially when the training dataset is limited. Current strategies aim to address this challenge through different domain generalization techniques, yet they have had limited success due to the risk of overfitting when solely relying on value labels for regression. Recent progress in pre-trained vision-language models has motivated us to capitalize on the abundant semantic information available. We propose a novel approach in this paper, reframing the gaze estimation task as a vision-language alignment issue. Our proposed framework, named Language-Guided Gaze Estimation (LG-Gaze), learns continuous and geometry-sensitive features for gaze estimation benefit from the rich prior knowledges of vision-language models. Specifically, LG-Gaze aligns gaze features with continuous linguistic features through our proposed multimodal contrastive regression loss, which customizes adaptive weights for different negative samples. Furthermore, to better adapt to the labels for gaze estimation task, we propose a geometry-aware interpolation method to obtain more precise gaze embeddings. Through extensive experiments, we validate the efficacy of our framework in four different cross-domain evaluation tasks.
凝视估计模型的泛化能力经常受到与凝视无关的各种因素的显著阻碍,尤其是在训练数据集有限的情况下。当前策略旨在通过不同的领域泛化技术来解决这一挑战,但由于仅依赖回归的价值标签容易导致过拟合,这些策略收效甚微。最近在预训练视觉-语言模型方面的进展促使我们利用丰富的语义信息。本文提出了一种新颖的方法,将凝视估计任务重新定义为一个视觉-语言对齐问题。我们的框架名为“语言引导的凝视估计”(LG-Gaze),它通过利用视觉-语言模型丰富的先验知识来学习连续且几何敏感的特征以进行凝视估计。具体来说,LG-Gaze 通过我们提出的多模态对比回归损失将凝视特征与连续的语言特征对齐,该损失能够为不同的负样本定制适应性权重。此外,为了更好地适应凝视估计任务的标签,我们提出了一种几何感知插值方法来获得更精确的凝视嵌入。通过广泛的实验验证了我们的框架在四个不同跨域评估任务中的有效性。
https://arxiv.org/abs/2411.08606
We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.
我们推出了GazeGen,一个用户交互系统,该系统能够根据用户的目光指示生成视觉内容(图像和视频)。通过使用注视点来定位感兴趣区域,GazeGen实现了对视觉内容的直观操作。利用先进的物体检测技术和生成式AI技术,GazeGen可以执行受控于目光的图像添加/删除、重新定位以及图像对象表面材质的变化,并且能够将静态图像转换成视频。核心组件是DFT Gaze(蒸馏和微调注视点)代理,这是一个超轻量级模型,仅包含281K参数,在小型边缘设备上实现了针对个体用户眼睛的精确实时注视预测。GazeGen是首个结合视觉内容生成与实时注视估计的系统,这完全得益于DFT Gaze的功能。 通过实时注视估计技术,所有视觉内容生成任务均可由用户的目光控制。DFT Gaze的输入为用户的眼部图像,而视觉内容生成则需要用户的视角和来自DFT Gaze预测的注视点作为输入。为了实现高效的注视预测,我们通过新颖的知识蒸馏和个人适应技术,从一个大型模型(原模型大小的10倍)中导出了这个小型模型。我们将知识蒸馏与掩码自动编码器相结合,开发出了一款紧凑且强大的注视估计模型。该模型进一步使用适配器进行微调,使得只需极少量的用户输入就能实现高度准确和个性化的注视预测。DFT Gaze确保了低延迟和精确的目光跟踪能力,支持广泛的眼动驱动任务。 我们在AEA和OpenEDS2020基准测试上验证了DFT Gaze的表现,证明其在边缘设备(树莓派4)上的角度误差小且延迟低。此外,我们还描述了GazeGen的应用实例,展示了其在各种使用场景中的多样性和有效性。
https://arxiv.org/abs/2411.04335
Gaze is a crucial social cue in any interacting scenario and drives many mechanisms of social cognition (joint and shared attention, predicting human intention, coordination tasks). Gaze direction is an indication of social and emotional functions affecting the way the emotions are perceived. Evidence shows that embodied humanoid robots endowing social abilities can be seen as sophisticated stimuli to unravel many mechanisms of human social cognition while increasing engagement and ecological validity. In this context, building a robotic perception system to automatically estimate the human gaze only relying on robot's sensors is still demanding. Main goal of the paper is to propose a learning robotic architecture estimating the human gaze direction in table-top scenarios without any external hardware. Table-top tasks are largely used in many studies in experimental psychology because they are suitable to implement numerous scenarios allowing agents to collaborate while maintaining a face-to-face interaction. Such an architecture can provide a valuable support in studies where external hardware might represent an obstacle to spontaneous human behaviour, especially in environments less controlled than the laboratory (e.g., in clinical settings). A novel dataset was also collected with the humanoid robot iCub, including images annotated from 24 participants in different gaze conditions.
注视是任何交互场景中至关重要的社会线索,驱动了许多社会认知机制(共同注意和共享注意力、预测人类意图、协调任务)。注视方向是对社会和情感功能的指示,影响情绪感知的方式。证据表明,赋予社交能力的身体化类人机器人可以被视为复杂刺激物,以揭示许多人类社会认知机制,同时增加参与度和生态有效性。在此背景下,构建一个仅依赖于机器人传感器自动估算人类注视的机器人感知系统仍然是一项挑战。本文的主要目标是提出一种学习型机器人架构,在桌面上的情境中无需任何外部硬件就能估计人类的注视方向。桌面任务在许多实验心理学研究中广泛使用,因为它们适合实施多种场景,允许代理在保持面对面互动的同时进行协作。这种架构可以在研究中提供有价值的帮助,尤其是在实验室以外、控制程度较低的环境中(如临床环境),外部硬件可能成为自发行为的障碍。此外,还用类人机器人iCub收集了一组新的数据集,包括从24名参与者在不同注视条件下注释的图像。
https://arxiv.org/abs/2410.19374