This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at this https URL.
该报告介绍了我们对CVPR 2025年Ego4D自然语言查询(NLQ)挑战赛的解决方案。自拍视频从穿戴者的视角捕捉场景,其中视线是关键的非言语交流线索,反映了视觉注意力,并提供了有关人类意图和认知的见解。受此启发,我们提出了一种新颖的方法GazeNLQ,该方法利用视线来检索与给定自然语言查询相匹配的视频片段。具体而言,我们引入了一种基于对比学习的预训练策略,直接从视频中进行视线估计。估算出的视线被用来增强模型中的视频表示,从而提高定位精度。实验结果显示,GazeNLQ在R1@IoU0.3和R1@IoU0.5评分上分别取得了27.82和18.68的成绩。我们的代码可在[此处](https://this_https_URL.com)获取。
https://arxiv.org/abs/2506.05782
In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection, eye-gaze estimation, and facial emotion recognition. OpenFace 3.0 contributes a lightweight unified model for facial analysis, trained with a multi-task architecture across diverse populations, head poses, lighting conditions, video resolutions, and facial analysis tasks. By leveraging the benefits of parameter sharing through a unified model and training paradigm, OpenFace 3.0 exhibits improvements in prediction performance, inference speed, and memory efficiency over similar toolkits and rivals state-of-the-art models. OpenFace 3.0 can be installed and run with a single line of code and operate in real-time without specialized hardware. OpenFace 3.0 code for training models and running the system is freely available for research purposes and supports contributions from the community.
近年来,计算领域的视觉、多模态交互、机器人和情感计算社区对自动面部行为分析系统产生了越来越浓厚的兴趣。基于先前开源的面部分析系统的广泛应用,我们推出了OpenFace 3.0,这是一个能够进行面部标志点检测、面部动作单元检测、眼动估计以及面部情绪识别的开源工具包。 OpenFace 3.0 提供了一种轻量级的统一模型,用于面部分析,并且该模型通过多样化的人群样本、头部姿态、光照条件、视频分辨率和不同的面部分析任务进行了跨任务架构的训练。通过利用统一模型及训练范式中的参数共享带来的优势,OpenFace 3.0 在预测性能、推理速度以及内存效率方面超越了类似的工具包,并且在与现有先进模型的比较中表现出色。 OpenFace 3.0 可以用单行代码安装并运行,并能在没有专门硬件的情况下实时操作。用于训练模型和运行系统的 OpenFace 3.0 代码免费提供给研究目的使用,并支持社区贡献。
https://arxiv.org/abs/2506.02891
Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.
移动眼球追踪面临一个基本挑战:在用户自然改变姿态和设备方向的情况下保持准确性。传统的校准方法,例如一次性校准,在应对这些动态条件时无法适应,导致性能随时间逐渐下降。我们提出了一种称为MAC-Gaze的方法,这是一种基于运动感知的连续校准方法,它利用智能手机惯性测量单元(IMU)传感器和连续学习技术来自动检测用户动作状态的变化,并相应地更新眼球追踪模型。 我们的系统集成了一个预训练的视觉眼球估计器和一个基于IMU的动作识别模型,以及一种基于聚类的混合决策机制,在运动模式显著偏离之前遇到的状态时触发重新校准。为了在适应新运动条件的同时减轻灾难性遗忘的影响,我们采用了基于重放的连续学习方法,使模型能够维持在先前遇到的不同运动条件下的一致性能。 我们在公共可用的RGBDGaze数据集和我们自己的10小时多模态MotionGaze数据集(超过48.1万张图像和80多万次IMU读数)上对系统进行了广泛的实验评估。这些数据集涵盖了坐姿、站姿、躺卧和行走等广泛姿态下的各种运动条件。 结果显示,与传统校准方法相比,我们的方法在RGBDGaze数据集中将眼球估计误差降低了19.9%(从1.73厘米降至1.41厘米),在MotionGaze数据集中则降低了31.7%(从2.81厘米降至1.92厘米)。我们的框架为保持移动场景中的眼球追踪准确性提供了一个稳健的解决方案。
https://arxiv.org/abs/2505.22769
Eye gaze can provide rich information on human psychological activities, and has garnered significant attention in the field of Human-Robot Interaction (HRI). However, existing gaze estimation methods merely predict either the gaze direction or the Point-of-Gaze (PoG) on the screen, failing to provide sufficient information for a comprehensive six Degree-of-Freedom (DoF) gaze analysis in 3D space. Moreover, the variations of eye shape and structure among individuals also impede the generalization capability of these methods. In this study, we propose MAGE, a Multi-task Architecture for Gaze Estimation with an efficient calibration module, to predict the 6-DoF gaze information that is applicable for the real-word HRI. Our basic model encodes both the directional and positional features from facial images, and predicts gaze results with dedicated information flow and multiple decoders. To reduce the impact of individual variations, we propose a novel calibration module, namely Easy-Calibration, to fine-tune the basic model with subject-specific data, which is efficient to implement without the need of a screen. Experimental results demonstrate that our method achieves state-of-the-art performance on the public MPIIFaceGaze, EYEDIAP, and our built IMRGaze datasets.
目光追踪可以提供丰富的人类心理活动信息,在人机交互(HRI)领域引起了极大的关注。然而,现有的目光估计方法只能预测目光方向或屏幕上的目光点(PoG),无法为三维空间中的六自由度(DoF)全面的目光分析提供足够信息。此外,个体间眼睛形状和结构的差异也阻碍了这些方法的泛化能力。在这项研究中,我们提出了MAGE,一种带有高效校准模块的多任务架构,用于预测适用于真实世界人机交互的六自由度目光信息。我们的基本模型从面部图像编码方向和位置特征,并通过专用的信息流和多个解码器来预测目光结果。为了减少个体差异的影响,我们提出了一种新颖的校准模块,即Easy-Calibration,该模块利用特定于对象的数据对基础模型进行微调,并且可以在无需屏幕的情况下高效实现。实验结果显示,我们的方法在公开的MPIIFaceGaze、EYEDIAP和我们构建的IMRGaze数据集上均取得了最先进的性能。
https://arxiv.org/abs/2505.16384
Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.
为了将训练良好的凝视估计模型推广到新的目标领域,开发了跨域凝视估计(CDGE)以应用于真实世界的场景。现有的CDGE方法通常提取领域不变特征来缓解特征空间中的领域偏移问题,但这种方法的有效性已被广义标签转移(GLS)理论证明是不足的。在本文中,我们从一个新的GLS视角引入了跨域凝视估计,并通过标签和条件转移的问题建模化了跨域问题。提出了一种GLS校正框架以及一种可行的实现方案,在该方案中,我们引入了一种基于截断高斯分布的重要性重新加权策略来克服在标签转移修正中的连续性挑战。为了将重新加权的源分布嵌入到条件不变学习中,我们进一步推导出了一种概率感知的条件操作符差异估计方法。 通过使用不同的骨干模型进行标准CDGE任务的大量实验验证了所提出的方法具有跨领域的优越泛化能力和在各种模型上的适用性。
https://arxiv.org/abs/2505.13043
We propose a novel 3D gaze estimation approach that learns spatial relationships between the subject and objects in the scene, and outputs 3D gaze direction. Our method targets unconstrained settings, including cases where close-up views of the subject's eyes are unavailable, such as when the subject is distant or facing away. Previous approaches typically rely on either 2D appearance alone or incorporate limited spatial cues using depth maps in the non-learnable post-processing step. Estimating 3D gaze direction from 2D observations in these scenarios is challenging; variations in subject pose, scene layout, and gaze direction, combined with differing camera poses, yield diverse 2D appearances and 3D gaze directions even when targeting the same 3D scene. To address this issue, we propose GA3CE: Gaze-Aware 3D Context Encoding. Our method represents subject and scene using 3D poses and object positions, treating them as 3D context to learn spatial relationships in 3D space. Inspired by human vision, we align this context in an egocentric space, significantly reducing spatial complexity. Furthermore, we propose D$^3$ (direction-distance-decomposed) positional encoding to better capture the spatial relationship between 3D context and gaze direction in direction and distance space. Experiments demonstrate substantial improvements, reducing mean angle error by 13%-37% compared to leading baselines on benchmark datasets in single-frame settings.
我们提出了一种新颖的3D注视估计方法,该方法通过学习场景中主体与物体之间的空间关系,并输出3D注视方向。我们的方法针对非约束环境,包括当拍摄对象距离较远或背对镜头时无法获得其眼睛特写的情况。以往的方法通常依赖于2D外观特征或者在不可学习的后处理步骤中使用深度图来融合有限的空间线索。在这种情况下,从2D观察结果推断3D注视方向具有挑战性;由于主体姿态、场景布局和注视方向的变化以及不同摄像机位置的影响,即便面对相同的3D场景,也会产生多样化的2D外观特征和3D注视方向。 为解决这一问题,我们提出了GA3CE(Gaze-Aware 3D Context Encoding)。该方法利用3D姿势及物体位置来表示主体与场景,并将其视为用于学习三维空间中空间关系的3D上下文。受人类视觉系统的启发,我们在以自我为中心的空间内对齐此上下文,大大减少了空间复杂度。此外,我们提出了D$^3$(方向-距离分解)的位置编码方法,以便更准确地捕捉3D上下文和注视方向在方向与距离上的空间关系。 实验结果显示,在单帧设置下,我们的方法相对于基准数据集上领先的基线方法显著降低了平均角度误差,减少了13%-37%。
https://arxiv.org/abs/2505.10671
Unconstrained gaze estimation is the process of determining where a subject is directing their visual attention in uncontrolled environments. Gaze estimation systems are important for a myriad of tasks such as driver distraction monitoring, exam proctoring, accessibility features in modern software, etc. However, these systems face challenges in real-world scenarios, partially due to the low resolution of in-the-wild images and partially due to insufficient modeling of head-eye interactions in current state-of-the-art (SOTA) methods. This paper introduces DHECA-SuperGaze, a deep learning-based method that advances gaze prediction through super-resolution (SR) and a dual head-eye cross-attention (DHECA) module. Our dual-branch convolutional backbone processes eye and multiscale SR head images, while the proposed DHECA module enables bidirectional feature refinement between the extracted visual features through cross-attention mechanisms. Furthermore, we identified critical annotation errors in one of the most diverse and widely used gaze estimation datasets, Gaze360, and rectified the mislabeled data. Performance evaluation on Gaze360 and GFIE datasets demonstrates superior within-dataset performance of the proposed method, reducing angular error (AE) by 0.48° (Gaze360) and 2.95° (GFIE) in static configurations, and 0.59° (Gaze360) and 3.00° (GFIE) in temporal settings compared to prior SOTA methods. Cross-dataset testing shows improvements in AE of more than 1.53° (Gaze360) and 3.99° (GFIE) in both static and temporal settings, validating the robust generalization properties of our approach.
无约束凝视估计是指在不受控制的环境中确定主体视觉注意力方向的过程。凝视估计系统对于多种任务至关重要,例如驾驶员分心监测、考试监考以及现代软件中的辅助功能等。然而,在现实世界场景中,这些系统面临着挑战,部分原因是野外图像分辨率低,另一部分原因则是当前最先进的(SOTA)方法在模拟头眼交互方面不够充分。 本文介绍了一种基于深度学习的方法——DHECA-SuperGaze,该方法通过超分辨率(SR)和双头眼交叉注意(DHECA)模块推进了凝视预测技术。我们的双分支卷积骨干网络处理眼睛图像及多尺度SR头部图像,而提出的DHECA模块则通过跨注意力机制实现提取视觉特征之间的双向细化功能。此外,我们还发现了一个最多样且广泛使用的眼动追踪数据集Gaze360中的关键标注错误,并对这些误标的数据进行了修正。 在Gaze360和GFIE数据集中进行的性能评估表明,与先前的SOTA方法相比,所提出的方法具有优越的内数据集表现,在静态配置下减少了0.48°(Gaze360)和2.95°(GFIE)的角度误差(AE),在时间设置下则分别减少了0.59°(Gaze360)和3.00°(GFIE)。跨数据集测试显示,在静态和时间设置中,角度误差分别提高了超过1.53°(Gaze360)和3.99°(GFIE),验证了我们方法的稳健泛化特性。
https://arxiv.org/abs/2505.08426
This paper presents a robust, occlusion-aware driver monitoring system (DMS) utilizing the Driver Monitoring Dataset (DMD). The system performs driver identification, gaze estimation by regions, and face occlusion detection under varying lighting conditions, including challenging low-light scenarios. Aligned with EuroNCAP recommendations, the inclusion of occlusion detection enhances situational awareness and system trustworthiness by indicating when the system's performance may be degraded. The system employs separate algorithms trained on RGB and infrared (IR) images to ensure reliable functioning. We detail the development and integration of these algorithms into a cohesive pipeline, addressing the challenges of working with different sensors and real-car implementation. Evaluation on the DMD and in real-world scenarios demonstrates the effectiveness of the proposed system, highlighting the superior performance of RGB-based models and the pioneering contribution of robust occlusion detection in DMS.
本文提出了一种稳健的、具有遮挡感知能力的驾驶员监控系统(DMS),该系统利用了驾驶员监测数据集(Driver Monitoring Dataset,简称DMD)。本系统能够在不同的照明条件下进行驾驶员识别、区域注视估计以及面部遮挡检测,包括挑战性的低光照场景。遵循EuroNCAP推荐标准,通过加入遮挡检测来增强情境感知和系统的可信度,以表明在某些情况下系统性能可能下降。 该系统采用分别针对RGB图像和红外(IR)图像训练的算法,从而确保其可靠运行。我们详细介绍了这些算法的发展与整合过程,以形成一个连贯的工作流程,并解决了不同传感器以及实际汽车应用中的挑战。通过DMD数据集及现实场景中对该系统的评估证明了该系统的效果,突出了基于RGB模型的优越性能和在DMS中实现稳健遮挡检测的开创性贡献。
https://arxiv.org/abs/2504.20677
We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.
我们提出了HOIGaze——一种针对扩展现实(XR)中手部与物体交互(HOI)过程中注视点估计的新型基于学习的方法。HOIGaze通过一个关键洞察力解决了这一具有挑战性的HOI场景:在HOIs期间,眼睛、手和头部的动作紧密协调,并且可以利用这种协调来识别对训练注视点估计算法最有用的数据样本——从而有效去噪训练数据。这种方法与以往将所有训练样本同等对待的注视估计方法形成了鲜明对比。 具体而言,我们提出了: 1)一种新颖的分层框架,该框架首先识别当前视觉注意的手部,然后基于所注意的手来估算注视方向; 2)一个新的注视点估计算法,利用跨模态Transformer融合由卷积神经网络提取的头部特征和空间-时间图卷积网络提取的手部与物体特征; 3)一种新颖的眼头协调损失函数,用于提升属于眼头协同运动的训练样本的质量。 我们在HOT3D及Aria数字孪生(ADT)数据集上评估了HOIGaze,并显示它显著优于现有最先进的方法,在HOT3D和ADT上的平均角度误差分别减少了15.6% 和6.0%。为了展示我们方法的潜力,我们也报告了在ADT上的基于眼睛活动识别下游任务中的显著性能改进。 综上所述,我们的结果强调了眼手头协调中包含的重要信息量,并因此为基于学习的注视点估计开辟了一个令人兴奋的新方向。
https://arxiv.org/abs/2504.19828
Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.
眼神估计,即预测凝视方向的任务,通常会面临来自面部图像中与凝视无关的复杂信息干扰。为此,在这项工作中,我们提出了一种新的目光估计框架——DMAGaze,该框架从三个方面利用了面部图像中的信息:相关的全局特征(从面部图像中分离出来)、局部眼睛特征(从裁剪出的眼睛区域提取)和头部姿态估计特征,以提升整体性能。 首先,我们设计了一个新的基于连续掩码的解耦器(Disentangler),通过分别重建眼部和非眼部区域来准确地将与凝视相关的信息与无关信息分离出来,从而实现双分支解耦的目标。此外,我们还引入了一种新的级联注意力模块——多尺度全局局部注意模块(Multi-Scale Global Local Attention Module, MS-GLAM)。通过定制的级联注意力结构,该模块能够有效关注多个尺度上的全局和局部信息,进一步增强了解耦器提供的信息。最后,上部面部分支解耦出的整体凝视相关特征与头部姿态和局部眼睛特征相结合,并通过检测头进行高精度的目光估计。 我们提出的DMAGaze框架已在两个主流公开数据集上进行了广泛验证,并取得了最先进的性能水平。
https://arxiv.org/abs/2504.11160
Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.
近期在硬件、计算机图形和人工智能方面的进展可能会使增强现实/虚拟现实头戴式显示器(HMD)成为像智能手机和平板电脑一样日常使用的设备。HMD 中的眼动追踪器为这种设置提供了特殊的机会,因为它可以促进基于注视的研究和交互。然而,估计用户的目光信息通常需要包含虹膜纹理的原始眼图像和视频,而这些被视作用于身份认证的标准生物特征数据,这引发了隐私方面的担忧。此前,在眼动追踪领域的研究主要集中在模糊化虹膜纹理的同时保持诸如视线估计等实用任务的准确性。尽管有这些尝试,但尚无全面评估前沿方法的基准测试。 考虑到所有因素,本文对模糊、噪声处理、降采样、橡胶片模型和虹膜风格转换这五种技术进行基准测试,以混淆用户身份,并比较它们在两个数据集上对图像质量、隐私性、实用性和冒充攻击风险的影响。我们使用眼睛分割和注视估计作为实用性任务,将虹膜识别准确性的降低视为隐私保护的衡量标准,并用错误接受率来估算攻击的风险。 我们的实验表明,传统的图像处理方法如模糊和添加噪声对于基于深度学习的任务影响甚微。然而,降采样、橡胶片模型以及虹膜风格转换在隐藏用户标识符方面都表现出色,其中虹膜风格转换以更高的计算成本为代价,在实用性任务中表现优于其他技术,并且更能抵御冒充攻击。 我们的分析表明,没有一种普遍适用的方法可以在隐私性、实用性和计算负担之间达到平衡。因此,我们建议实践者应考虑每种方法的优点和缺点,以及这些方法可能的组合方式,以实现最佳的隐私与实用性权衡。
https://arxiv.org/abs/2504.10267
This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study, we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot's explanations for failures to enhance the collaborative experience.
这项工作旨在解读人类行为,以便在机器人提供故障解释时预测潜在的用户困惑,从而使机器人能够根据情况调整其解释方式,以实现更自然和高效的协作。我们使用了一个数据集,该数据集中包含了55名参与者在用户研究中表现出的面部表情检测、目光估计以及手势等信息,分析了人类行为如何随着不同类型故障及不同解释水平的变化而变化。我们的目标是评估人类合作者是否能够在不引起困惑的情况下接受更简洁的解释。 我们制定了一个基于数据驱动的方法来预测机器人在提供故障解释过程中的人类困惑程度,并提出了并评估了一种机制,该机制可以根据观察到的人类行为调整解释水平。这项研究的初步结果表明,根据人类的行为适应机器人的解释方式具有增强协作体验的潜力。
https://arxiv.org/abs/2504.09717
Iris texture is widely regarded as a gold standard biometric modality for authentication and identification. The demand for robust iris recognition methods, coupled with growing security and privacy concerns regarding iris attacks, has escalated recently. Inspired by neural style transfer, an advanced technique that leverages neural networks to separate content and style features, we hypothesize that iris texture's style features provide a reliable foundation for recognition and are more resilient to variations like rotation and perspective shifts than traditional approaches. Our experimental results support this hypothesis, showing a significantly higher classification accuracy compared to conventional features. Further, we propose using neural style transfer to mask identifiable iris style features, ensuring the protection of sensitive biometric information while maintaining the utility of eye images for tasks like eye segmentation and gaze estimation. This work opens new avenues for iris-oriented, secure, and privacy-aware biometric systems.
虹膜纹理被广泛认为是身份认证和识别的黄金标准生物特征模态。随着对可靠虹膜识别方法的需求日益增加,以及关于虹膜攻击的安全性和隐私问题的关注加剧,最近人们对这一领域产生了更大的兴趣。受神经风格迁移(一种利用神经网络分离内容与风格特征的先进技术)启发,我们假设虹膜纹理中的风格特征可以为识别提供一个可靠的依据,并且比传统方法更能抵抗旋转和透视变化带来的影响。我们的实验结果支持了这一假设,显示出相比传统的特征提取方法,分类准确率有了显著提升。 此外,我们提出使用神经风格迁移来掩盖可识别人脸风格的虹膜特征,以确保敏感生物识别信息的安全性,同时保持眼睛图像在诸如眼球分割和注视估计等任务中的实用性。这项工作为面向虹膜、安全且隐私意识强的生物识别系统开辟了新的途径。
https://arxiv.org/abs/2503.04707
Gaze estimation models are widely used in applications such as driver attention monitoring and human-computer interaction. While many methods for gaze estimation exist, they rely heavily on data-hungry deep learning to achieve high performance. This reliance often forces practitioners to harvest training data from unverified public datasets, outsource model training, or rely on pre-trained models. However, such practices expose gaze estimation models to backdoor attacks. In such attacks, adversaries inject backdoor triggers by poisoning the training data, creating a backdoor vulnerability: the model performs normally with benign inputs, but produces manipulated gaze directions when a specific trigger is present. This compromises the security of many gaze-based applications, such as causing the model to fail in tracking the driver's attention. To date, there is no defense that addresses backdoor attacks on gaze estimation models. In response, we introduce SecureGaze, the first solution designed to protect gaze estimation models from such attacks. Unlike classification models, defending gaze estimation poses unique challenges due to its continuous output space and globally activated backdoor behavior. By identifying distinctive characteristics of backdoored gaze estimation models, we develop a novel and effective approach to reverse-engineer the trigger function for reliable backdoor detection. Extensive evaluations in both digital and physical worlds demonstrate that SecureGaze effectively counters a range of backdoor attacks and outperforms seven state-of-the-art defenses adapted from classification models.
注视估计模型广泛应用于驾驶员注意力监控和人机交互等应用中。尽管存在许多注视估计的方法,但这些方法往往依赖于数据密集型的深度学习技术来实现高性能。这种对大量训练数据的需求经常迫使从业者从未经验证的公共数据集中收集训练数据、外包模型训练或依赖预训练模型。然而,这样的做法会使得注视估计模型容易遭受后门攻击。在这些攻击中,对手通过污染训练数据注入后门触发器,在模型中创建一个后门漏洞:模型在接受正常输入时表现正常,但在特定触发器存在的情况下会产生操纵后的注视方向。这破坏了许多基于注视的应用程序的安全性,例如导致模型无法正确追踪驾驶员的注意力。迄今为止,还没有针对这类攻击进行防御的方法。 为解决这个问题,我们引入了SecureGaze,这是首个旨在保护注视估计模型免受此类攻击威胁的解决方案。与分类模型不同的是,由于其连续输出空间和全局激活的后门行为,保护注视估计面临着独特的挑战。通过识别带有后门的注视估计模型的独特特性,我们开发了一种新颖且有效的方法来反向工程触发函数,从而实现可靠的后门检测。 在数字世界和物理世界的广泛评估中显示,SecureGaze能够有效地对抗一系列后门攻击,并优于七种从分类模型改编而来的最新防御方法。
https://arxiv.org/abs/2502.20306
Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-WSGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.
在不受限制的真实世界环境中进行精确的三维注视估计仍然是一个重大挑战,这主要是由于外观变化、头部姿态的变化、遮挡以及缺乏大规模的实际三维注视数据集所致。为了解决这些问题,我们提出了一种新的自我训练弱监督注视估计框架(Self-Training Weakly-Supervised Gaze Estimation, ST-WSGE)。这一双阶段学习框架利用了各种二维注视数据集(如跟随注视的数据),这些数据集提供了丰富的外观变化、自然场景和注视分布的多样性,并提出了生成三维伪标签的方法以增强模型泛化能力。此外,传统的特定模式设计的模型(分别针对图像或视频进行设计)限制了现有训练数据的有效利用。为了解决这一问题,我们提出了一种模态无关架构——注视变换器(Gaze Transformer, GaT),该架构能够同时从图像和视频数据集中学习静态和动态的注视信息。通过结合3D视频数据集与来自跟随注视任务的2D注视目标标签,我们的方法实现了以下关键贡献: (i) 在不受限制基准测试(如Gaze360和GFIE)中的域内和跨域泛化方面取得了显著的最新技术水平改进,并在视频注视估计中获得了显着的跨模态收益; (ii) 与面向正面脸部的方法相比,在MPIIFaceGaze和Gaze360等数据集上展现了优越的跨域性能。 代码和预训练模型将向社区开放。
https://arxiv.org/abs/2502.20249
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel CLIP-driven Dual Feature Enhancing Network (CLIP-DFENet), which boosts gaze estimation performance with the help of CLIP under a novel `main-side' collaborative enhancing strategy. Accordingly, a Language-driven Differential Module (LDM) is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gaze. This module could empower our Core Feature Extractor with the capability of characterizing the gaze-related semantic information. Moreover, a Vision-driven Fusion Module (VFM) is introduced to strengthen the generalized and valuable components of visual embeddings obtained via CLIP's image encoder, and utilizes them to further improve the generalization of the features captured by Core Feature Extractor. Finally, a robust Double-head Gaze Regressor is adopted to map the enhanced features to gaze directions. Extensive experimental results on four challenging datasets over within-domain and cross-domain tasks demonstrate the discriminability and generalizability of our CLIP-DFENet.
复杂的应用场景提出了对精确且通用的注视估计方法的关键需求。最近,预训练的CLIP在各种视觉任务上取得了显著性能,但其在注视估计中的潜力尚未被充分挖掘。本文中,我们提出了一种新颖的基于CLIP的双特征增强网络(CLIP-DFENet),该网络利用CLIP并通过一种创新的“主侧”协作增强策略来提升注视估计的表现。为此,设计了一个以CLIP的文字编码器为基础的语言驱动差异模块(LDM)来揭示注视的语义差异。此模块能够赋予我们的核心特征提取器识别与注视相关语义信息的能力。此外,引入了一种视觉驱动融合模块(VFM),通过增强从CLIP图像编码器获得的视觉嵌入中的通用且有价值的组件,并利用这些组件进一步提升由核心特征提取器捕获特征的一般化能力。最后,采用了一个稳健的双头注视回归器来将增强后的特征映射到注视方向上。在四个具有挑战性的数据集上的广泛实验结果显示了我们提出的CLIP-DFENet模型在域内和跨域任务中的区分能力和泛化能力。
https://arxiv.org/abs/2502.20128
Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.
基于自我的视频注视估计要求模型能够捕捉个体的注视模式,并适应多样化的用户数据。我们的方法采用了一种基于Transformer架构的方法,将其整合到一个PFL(个性化联邦学习)框架中,在此框架中,仅选择在训练过程中变化最大的重要参数进行冻结和个性化处理,以应用于客户端模型。通过在EGTEA Gaze+ 和 Ego4D 数据集上进行广泛的实验,我们展示了FedCPF方法显著优于之前报道的联邦学习方法,在召回率、精确度和F1分数方面均表现出色。这些结果确认了我们的全面参数冻结策略能够有效增强模型个性化能力,并使FedCPF成为在需要高度适应性和准确性的联邦学习任务中极具前景的方法。
https://arxiv.org/abs/2502.18123
Mobile gaze tracking involves inferring a user's gaze point or direction on a mobile device's screen from facial images captured by the device's front camera. While this technology inspires an increasing number of gaze-interaction applications, achieving consistent accuracy remains challenging due to dynamic user-device spatial relationships and varied motion conditions inherent in mobile contexts. This paper provides empirical evidence on how user mobility and behaviour affect mobile gaze tracking accuracy. We conduct two user studies collecting behaviour and gaze data under various motion conditions - from lying to maze navigation - and during different interaction tasks. Quantitative analysis has revealed behavioural regularities among daily tasks and identified head distance, head pose, and device orientation as key factors affecting accuracy, with errors increasing by up to 48.91% in dynamic conditions compared to static ones. These findings highlight the need for more robust, adaptive eye-tracking systems that account for head movements and device deflection to maintain accuracy across diverse mobile contexts.
移动凝视追踪技术通过分析设备前置摄像头捕捉到的面部图像来推断用户在移动设备屏幕上注视点或方向。虽然这项技术激发了越来越多以凝视为交互的应用程序,但由于移动环境中的用户与设备之间的空间关系不断变化以及运动条件多样,保持高精度仍然是一个挑战。本文提供了实证证据,说明用户的移动行为如何影响移动凝视追踪的准确性。我们进行了两项用户研究,在各种不同的运动条件下(从躺卧到迷宫导航)和不同交互任务期间收集了行为和注视数据。定量分析揭示了日常任务中的行为规律,并确定头部距离、头部姿态以及设备方向是影响精度的关键因素,与静态条件相比,在动态条件下错误率最多可增加48.91%。这些发现强调需要开发更加稳健且适应性强的眼动追踪系统,以应对头部运动和设备偏移的影响,从而在多种移动环境中保持准确性。
https://arxiv.org/abs/2502.10570
3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for this misalignment. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices. The superior performance highlights the effectiveness of our approach, and demonstrates its strong potential for real-world applications.
3D和2D注视估计虽然有着共同的基本目标,即捕捉眼球运动,但传统上被视为两个独立的研究领域。在本文中,我们提出了一种新颖的跨任务少样本2D注视估计方法,旨在利用少量训练图像,将预训练好的3D注视估算网络适应到未见过设备上的2D注视预测。这项任务由于3D和2D注视之间的域差距、未知屏幕姿态以及有限的训练数据而极具挑战性。 为了解决这些难题,我们提出了一种新颖的框架来弥合3D和2D注视间的鸿沟。该框架包含一个基于物理原理的可学习参数化差分投影模块,用于建模屏幕姿态并将3D注视转换成2D注视。整个框架是全差异化的,并且能够不改变原有架构的情况下整合到现有的3D注视网络中。 此外,我们还引入了一种针对翻转图像的动态伪标签策略,这对未知屏幕姿态下的2D标签来说尤其具有挑战性。为克服这一困难,我们将转换过程逆向进行,即将2D标签转化为3D空间,在那里执行翻转操作。值得注意的是,这个3D空间并未与相机坐标系统对齐,因此我们学习了一个动态变换矩阵来补偿这种错位。 我们在MPIIGaze、EVE和GazeCapture数据集上评估了我们的方法,这些数据分别采集自笔记本电脑、台式计算机和移动设备上。我们的方法表现出色,并突显了其在现实世界应用中的强大潜力。
https://arxiv.org/abs/2502.04074
Despite decades of research on data collection and model architectures, current gaze estimation models face significant challenges in generalizing across diverse data domains. While recent advances in self-supervised pre-training have shown remarkable potential for improving model generalization in various vision tasks, their effectiveness in gaze estimation remains unexplored due to the geometric nature of the gaze regression task. We propose UniGaze, which leverages large-scale, in-the-wild facial datasets through self-supervised pre-training for gaze estimation. We carefully curate multiple facial datasets that capture diverse variations in identity, lighting, background, and head poses. By directly applying Masked Autoencoder (MAE) pre-training on normalized face images with a Vision Transformer (ViT) backbone, our UniGaze learns appropriate feature representations within the specific input space required by downstream gaze estimation models. Through comprehensive experiments using challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. The source code and pre-trained models will be released upon acceptance.
尽管在数据收集和模型架构方面进行了数十年的研究,当前的视线估计模型仍然面临着跨不同数据域进行泛化的重大挑战。虽然最近在自我监督预训练方面的进展已经在各种视觉任务中展示了显著改善模型泛化的能力,但由于注视回归任务的几何特性,其在注视估计中的有效性尚未被探索。我们提出了UniGaze,该系统利用大规模、现实环境中的人脸数据集通过自我监督预训练来进行注视估计。我们精心策划了多个捕捉身份、光照、背景和头部姿态多种变化的人脸数据集。通过直接将遮罩自动编码器(MAE)预训练应用于具有视觉变换器(ViT)骨干的标准化面部图像,我们的UniGaze学习到了下游注视估计模型所需特定输入空间内的适当特征表示。通过使用具有挑战性的跨数据集评估和包括留一数据集外和联合数据集设置在内的新型协议进行详尽实验,我们证明了UniGaze在多个数据域中的泛化性能得到了显著提升,并且最大限度地减少了对昂贵标注数据的依赖。源代码和预训练模型将在接受后发布。
https://arxiv.org/abs/2502.02307