We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at this https URL.
我们介绍了GazeD,这是一种新的3D凝视估计方法,可以从单个RGB图像中同时提供3D凝视方向和人体姿态。通过利用扩散模型处理不确定性的能力,它基于从输入图像提取的2D上下文信息生成多个合理的3D凝视方向和姿势假设。具体来说,我们在去噪过程中以2D姿态、主体周围的环境以及场景背景作为条件进行操作。使用GazeD,我们还引入了一种新的表示3D凝视的方法,即将其定位为距眼睛固定距离的一个额外身体关节。理由是凝视通常与姿态密切相关,因此在扩散处理中共同去噪可以获益。跨三个基准数据集的评估表明,GazeD在3D凝视估计方面达到了最先进的性能,甚至超越了依赖于时间信息的方法。项目详情将在[此处](https://this https URL)提供。
https://arxiv.org/abs/2601.12948
We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49°, 3.22°, 10.16°, and 1.44°, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.
我们提出了一种语义调节的多尺度Transformer,用于3D注视点估计。我们的模型使用可学习的原型库(光照、头部姿态、背景和方向)来调整CLIP全局特征,并将这些经过丰富处理的全局向量与CLIP补丁标记和高分辨率CNN令牌在统一注意空间中融合在一起。此外,我们用路由/共享专家混合体替换了若干前馈网络块以增加条件容量。 我们在MPIIFaceGaze、EYEDIAP、Gaze360和ETH-XGaze数据集上对模型进行了评估,我们的模型分别实现了2.49°、3.22°、10.16°和1.44°的新最佳角误差,相较于先前报告的结果有了高达64%的相对改进。消融研究表明,原型调节、跨尺度融合、专家混合体以及超参数调整是性能提升的关键因素。 我们的代码在 https://github.com/AIPMLab/Gazeformer 上公开可用。
https://arxiv.org/abs/2601.12316
We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.
我们介绍EyeTheia,这是一个轻量级的开源深度学习流水线,用于基于网络摄像头的眼动追踪估计,旨在为浏览器基础实验平台和现实世界中的认知及临床研究设计。EyeTheia能够仅使用标准笔记本电脑内置网络摄像头实现实时眼动追踪,它结合了MediaPipe基元的特征点提取与受iTracker启发的卷积神经网络,并提供可选的用户特定微调选项。 我们调查了两种互补策略:一种是在移动数据上预训练模型进行适应,另一种是从头开始在以桌面为中心的数据集上训练相同架构。在MPIIFaceGaze验证结果表明,在校准之前,这两种方法表现出相当的性能,而轻量级的用户特定微调选项能持续减少眼动预测误差。 我们进一步在一个现实的Dot-Probe任务中评估EyeTheia,并将其与商用网络摄像头眼动追踪器SeeSo SDK进行比较。结果显示,在呈现刺激时,左右眼动分配有强烈的符合性,尽管时间变异性较高。总体而言,EyeTheia为低成本眼动追踪提供了一个透明且可扩展的解决方案,适合大规模和可重复性的实验及临床研究。 该项目的代码、训练好的模型以及实验材料均公开可用。
https://arxiv.org/abs/2601.06279
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.
基于视频的眼动估计方法旨在从多帧图像中捕捉人类眼动的内在时间动态。然而,由于模型必须同时捕获空间和时间关系,其性能不仅受限于单帧内的特征表示,还受到多个帧之间特征表示的影响。为此,我们提出了一个结合了卷积神经网络(CNN)骨干网、专用通道注意力机制和自注意模块的时空眼动网络(ST-Gaze),该模型能够将眼睛和面部特征最优地融合在一起。 这些融合后的特征被视为空间序列处理,使得可以捕捉到单帧内的上下文信息。随后,这种上下文信息通过时间传播来建模多帧之间的动态变化。我们在EVE数据集上评估了我们的方法,并展示了ST-Gaze在有无特定个体适应的情况下均达到了最先进的性能。 此外,我们进行的消融研究进一步揭示了模型性能的关键因素,表明利用我们提出的时空递归结构来保留和建模单帧内的空间上下文与过早的空间池化相比具有根本性的优势。因此,我们的研究成果为使用常见摄像头实现更稳健的视频眼动估计开辟了道路。
https://arxiv.org/abs/2512.17673
Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at this https URL.
https://arxiv.org/abs/2511.13222
Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.
目前的3D凝视估计方法在不同数据域之间难以泛化,主要原因是:i) 标注数据集稀少;ii) 标记数据种类不足。为此,我们提出了一种半监督框架OmniGaze,用于解决这一问题。该框架利用大规模未标记的真实世界面部图像来减轻领域偏差并提高凝视估计的泛化能力。首先,我们构建了一个多样化的未标记面部图像集合,这些图像是在不同面部外观、背景环境、光照条件、头部姿态和眼睛遮挡下采集到的。 为了充分利用广泛的分布范围内的未标注数据,OmniGaze采用了标准的伪标签策略,并设计了一个奖励模型来评估伪标签的可靠性。除了作为3D方向向量的伪标签外,该奖励模型还结合了现成视觉编码器提取出的视觉嵌入以及由多模态大型语言模型生成的凝视视角语义线索以计算置信度分数。然后利用这些分数选择高质量的伪标签,并为损失计算加权。 广泛的实验表明,OmniGaze在五种数据集上(无论是领域内还是跨域)均达到了最先进的性能水平。此外,我们还评估了OmniGaze作为凝视估计可扩展数据引擎的有效性,在四个未见过的数据集中表现出稳健的零样本泛化能力。
https://arxiv.org/abs/2510.13660
This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.
本文评估了在人机交互(HRI)环境中共享工作空间场景下的当前凝视估计方法。我们引入了一个新的、带有标注的数据集,该数据集是使用NICO机器人平台收集的。我们评估了四种最先进的凝视估计模型。评估结果显示,在通用基准测试中报告的角度误差与此相近,但在表示为共享工作空间内的距离时,最佳的中位误差为16.48厘米,这量化了当前方法的实际局限性。最后,我们讨论了这些局限性,并提出了如何在HRI系统中最好地集成凝视估计这一模态的建议。
https://arxiv.org/abs/2509.24001
We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: this https URL
我们介绍了CapStARE,这是一种基于胶囊的时空架构,用于眼动追踪。该架构结合了ConvNeXt骨干网络、带有注意力路由的胶囊生成以及专门针对慢速和快速眼球动态的双GRU解码器。这种模块化设计使得有效的部分-整体推理和时间模型分离成为可能,在ETH-XGaze(3.36)和MPIIFaceGaze(2.65)数据集上实现了最先进的性能,同时保持了实时推断速度(< 10 毫秒)。该模型在不受约束条件下的Gaze360(9.06)和人机交互场景中的RT-GENE(4.76)中也具有良好的泛化能力,并且通过使用更少的参数和更高的可解释性优于或匹敌现有的方法。这些结果表明,CapStARE为互动系统中的实时眼动追踪提供了一个实用而稳健的解决方案。与此文章相关的代码和结果可以在以下链接找到:[此URL]
https://arxiv.org/abs/2509.19936
This paper examines the key factors that influence the performance of state-of-the-art gaze-based authentication. Experiments were conducted on a large-scale, in-house dataset comprising 8,849 subjects collected with Meta Quest Pro equivalent hardware running a video oculography-driven gaze estimation pipeline at 72Hz. The state-of-the-art neural network architecture was employed to study the influence of the following factors on authentication performance: eye tracking signal quality, various aspects of eye tracking calibration, and simple filtering on estimated raw gaze. We found that using the same calibration target depth for eye tracking calibration, fusing calibrated and non-calibrated gaze, and improving eye tracking signal quality all enhance authentication performance. We also found that a simple three-sample moving average filter slightly reduces authentication performance in general. While these findings hold true for the most part, some exceptions were noted.
本文研究了影响当前最先进的基于注视的认证性能的关键因素。实验是在一个大规模内部数据集上进行的,该数据集包含8,849名受试者的数据,这些数据是使用与Meta Quest Pro等效硬件采集的,并运行每秒72帧的视频眼动仪驱动的注视估计管道。为了研究以下因素对认证性能的影响,采用了最先进的神经网络架构:眼球追踪信号质量、眼球追踪校准的各种方面以及对原始估计注视进行简单滤波。 我们发现使用相同深度的目标进行眼球追踪校准、融合校准和非校准注视以及提高眼球追踪信号质量都能增强认证性能。另外还发现一般情况下简单的三样本移动平均滤波器会稍微降低认证性能。尽管这些研究结果在大多数情况下是成立的,但也注意到一些例外情况。
https://arxiv.org/abs/2509.10969
The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.
高级多模态大型语言模型(MLLM)的出现显著增强了AI助手处理跨模式复杂信息的能力。最近,以第一人称视角拍摄的视频直接捕捉用户的关注点、行动和统一坐标系中的上下文,为利用MLLM实现主动且个性化的用户体验提供了令人兴奋的机会。然而,现有的基准测试忽略了注视作为指示用户意图的重要指标。为了弥补这一不足,我们引入了EgoGazeVQA——一个基于第一人称视角注视引导的视频问答基准测试,该测试通过利用注视信息来改善对日常生活中较长视频的理解。EgoGazeVQA包含由MLLM生成并经人类标注员完善后的基于注视的问题和答案对。 我们的实验表明,现有的MLLM难以准确解读用户意图。相比之下,我们提出的基于注视引导的意图提示方法,通过整合空间、时间以及与意图相关的线索,显著提高了性能。此外,我们还针对注视相关微调进行了实验,并分析了注视估计精度如何影响提示效果的有效性。这些结果强调了在第一人称视角场景中,利用注视信息对于创建更加个性化且有效的AI助手的重要性。
https://arxiv.org/abs/2509.07447
With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k < 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at this https URL.
随着人工智能的发展,新的凝视估计方法正在超越现有的最佳水平(SOTA)基准,但它们在实际应用中与商用眼球追踪解决方案之间存在差距。模型大小、推理时间以及隐私等问题往往未被充分考虑。同时,基于网络摄像头的眼球追踪方法由于头部移动等原因,准确性不足。为了解决这些问题,我们推出了WebEyeTrack框架,它将轻量级的SOTA凝视估计模型直接集成到浏览器中。该框架结合了基于模型的头部姿态估算,并使用最少九个校准样本(k<9)进行设备上的少量学习。WebEyeTrack能够适应新用户,在GazeCapture数据集上实现了误差为2.32厘米的最佳性能,并在iPhone 14上达到了每帧2.4毫秒的实时推理速度。我们的开源代码可在此链接获取:[此URL](请将方括号中的文本替换为实际链接)。
https://arxiv.org/abs/2508.19544
Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.
尽管基于外观的注视点(PoG,Point of Gaze)估计已经得到了改进,但由于个人差异的存在,估算器仍然难以在不同个体之间进行泛化。因此,为了实现准确的PoG估计,需要针对特定个体进行校准。然而,经过校准的PoG估算器通常对头部姿态的变化很敏感。为了解决这一问题,我们研究了影响校准估算器的关键因素,并探索了能够抵抗姿态变化的校准策略。 具体而言,我们首先构建了一个基准测试集MobilePoG,其中包含32个个体在固定或持续变化的头部姿态下注视指定点时采集到的脸部图像。利用这一基准测试集,我们系统地分析了校准点的多样性以及头部姿态对估计准确性的影响。我们的实验表明,在校准时引入更广泛的头部姿态范围可以提高估算器处理姿势变化的能力。 基于此洞察,我们提出了一种动态校准策略:用户在移动手机的同时注视着校准点。这种策略自然地在用户体验友好且高效的校准过程中引入了头部姿态的变化,最终产生了一个对头部姿态变化不那么敏感的更好的PoG校准估算器。代码和数据集可以在我们的项目页面上获取。
https://arxiv.org/abs/2508.10268
We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.
我们提出了一种新颖的3D目光重定向框架,该框架利用了显式的3D眼球结构。现有的目光重定向方法通常基于神经辐射场(NeRF),这些方法通过体积渲染使用隐式神经表示。与这些基于NeRF的方法不同,其中3D表示的旋转和平移没有被明确建模,我们引入了一个专门的3D眼球结构,并利用三维高斯点阵(3DGS)来表示眼球。我们的方法通过显式地旋转和移动3D眼球结构生成逼真的图像,从而忠实再现所需的目光方向。此外,我们还提出了一种自适应变形模块,使其能够复制眼睛周围细微的肌肉运动。 通过对ETH-XGaze数据集进行实验,我们证明了我们的框架能够在生成多样化的新目光图像方面表现出色,并且在图像质量和目光估计精度上优于现有的最先进的方法。
https://arxiv.org/abs/2508.06136
This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at this https URL
本文提出了一种利用多个摄像头视角进行注视目标估计(GTE)任务的方法。该方法整合了不同视角的信息,以提高准确性并扩展适用性,解决了现有单视图方法在面对诸如面部遮挡、目标模糊性和超出视野的目标等挑战时的局限性。我们的方法将一对相机视图作为输入,包括了一个头部信息聚合(HIA)模块,用于从两个视角利用头部信息进行更精确的注视估计;一个基于不确定性的注视选择(UGS),用以识别最可靠的注视输出;以及一个基于极线的场景注意力(ESA)模块,用于跨视图共享背景信息。此方法显著优于单视图基线,在第二个摄像头提供清晰面部视角的情况下尤为突出。此外,我们的方法能够仅通过第二视角中的人脸图像来估计第一视角中的注视目标,这是单视图GTE方法所不具备的能力。另外,本文还介绍了一个用于开发和评估多视图GTE方法的多视图数据集。数据和代码可在以下链接获取:[此 URL]
https://arxiv.org/abs/2508.05857
Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches. The code is publicly available at this https URL.
通过增强现实(AR)和虚拟现实(VR)的人机交互越来越普遍,这要求进行准确且高效的视线估计,而这又依赖于眼睛分割的准确性以实现流畅的用户体验。我们引入了EyeSeg,这是一个新型的眼睛分割框架,旨在克服现有方法在运动模糊、眼睑遮挡以及训练与测试领域差异等方面的挑战。在这种情况下,现有的模型难以提取出稳健的功能特征,导致性能不佳。 注意到这些挑战可以一般性地量化为不确定性,我们将EyeSeg设计成一个针对AR/VR的不确定性感知的眼睛分割框架,在其中我们通过执行闭集先验下的贝叶斯不确定性学习来显式建模不确定性。理论上,我们证明了所学后验统计量指示了分割不确定性的水平,并在下游任务(如视线估计)中实证性地优于现有方法。 EyeSeg输出一个不确定性评分和分割结果,在多个眼睛位置的估计中进行加权融合以增强鲁棒性,这尤其在运动模糊、眼睑遮挡及跨领域挑战下效果显著。此外,实验结果显示EyeSeg在MIoU(平均交并比)、E1、F1和ACC(准确率)等指标上的分割改进超过了以往的方法。 EyeSeg的代码已公开,可在此网址访问:[请将此链接替换为实际提供的网址]。
https://arxiv.org/abs/2507.09649
Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm's movement based on the calculated orientation.
眼科手术机器人通过减少人类外科医生的自然手颤,提供卓越的稳定性和精度,使精细的操作能够在狭小的手术空间中进行。尽管在开发基于视觉和力反馈的控制方法方面取得了进展,但术前导航仍然严重依赖手动操作,这限制了一致性并增加了不确定性。现有的眼科手术中的目光估计技术,无论是传统的还是基于深度学习的方法,都面临着诸如对额外传感器的依赖、手术环境中遮挡问题以及面部检测需求等挑战。 为了克服这些局限性,本研究提出了一种创新的眼部定位和跟踪方法,该方法结合了机器学习与传统算法,消除了地标的需求,并在不同的光照和阴影条件下保持稳定的虹膜检测和目光估计。大量的真实世界实验结果表明,我们提出的方法在眼部方向估计上的平均误差为0.58度,在根据计算方向控制机械臂运动时的平均控制误差为2.08度。
https://arxiv.org/abs/2507.00635
Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: this https URL.
让机器人理解人类的目光注视目标是实现在下游任务中具备能力的关键一步,例如在现实世界的人机交互中的注意力估计和动作预测。先前的研究通过仔细移除超出画面范围的数据样本,采用数据驱动的方法解决了画面内目标定位的问题。基于视觉的视线估计方法(如OpenFace)无法有效地吸收图像背景信息,并且不能预测当主体目光偏离相机时的目标注视点。在这项工作中,我们提出了一种系统来解决从一般化视觉场景中的图像进行360度视野内的注视目标估算问题。该系统名为GazeTarget360,集成了眼部接触检测器的条件推理引擎、预训练的视觉编码器以及多尺度融合解码器。 交叉验证结果显示,GazeTarget360可以在未见过的情景中生成准确且可靠的视线目标预测结果。这使得它成为首个能够从现实摄像机拍摄的画面中预测注视点的目标系统,并且具有高效率和可部署性。我们的源代码可在以下链接公开获取:[这里插入链接] (请将此链接替换为实际提供的URL)。
https://arxiv.org/abs/2507.00253
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments.
基于事件的眼动追踪在细粒度认知状态推断方面具有巨大的潜力,它提供了高时间分辨率和对运动伪影的鲁棒性,这是解码注意力、困惑或疲劳等微妙心理状态的关键特性。在这项工作中,我们提出了一种模型无关的推理时增强框架,旨在通过不修改现有基于事件的眼动估计模型架构且无需重新训练的方式来提升这些模型的输出效果。我们的方法包括两个关键后处理模块:(i) 运动感知中值滤波器,它抑制眨眼引起的尖峰信号,同时保持自然眼动动态;(ii) 基于光流的地方精化,将眼动预测与累积事件运动对齐,以减少空间抖动和时间不连续性。为了补充传统的空间准确性度量标准,我们提出了一种新颖的抖动度量指标,该指标基于速度规律性和局部信号复杂度捕捉预测的眼动轨迹的时间平滑度。这些贡献共同提高了基于事件眼动信号的一致性,使其更适合微表情分析和心理状态解码等下游任务。我们的结果在多个基准模型上显示出一致的改进效果,并且是在控制数据集上的成果,为未来与多模式情感识别系统在现实环境中的集成奠定了基础。
https://arxiv.org/abs/2506.12524
This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.
这项研究评估了一种基于智能手机的深度学习眼动追踪算法,通过将其性能与商用红外眼动追踪器Tobii Pro Nano进行比较来实现。目的是在现实中的移动使用条件下调查外观基础的眼球注视点估计方法的可行性。系统地分析了包括年龄、性别、视力矫正、光照条件、设备类型和头部位置等关键敏感因素。 基于外观的方法结合了一种轻量级卷积神经网络(MobileNet-V3)与递归结构(长短期记忆,LSTM),以从灰度面部图像中预测注视点坐标。使用动态视觉刺激收集了51名参与者的眼动数据,并用欧氏距离来测量准确性。 深度学习模型产生的平均误差为17.76毫米,相比之下Tobii Pro Nano的误差为16.53毫米。尽管总体准确性的差异较小,但基于深度学习的方法对于诸如光照、视力矫正和年龄等因素更敏感,在使用眼镜的参与者和老年群体中,在低光条件下观察到了较高的失败率。 特定设备和位置因素也影响了追踪性能。这些结果突显了在移动眼动追踪中外观基础方法的应用潜力,并为评估不同使用条件下的注视点估计系统提供了一个参考框架。
https://arxiv.org/abs/2506.11932
This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at this https URL.
该报告介绍了我们对CVPR 2025年Ego4D自然语言查询(NLQ)挑战赛的解决方案。自拍视频从穿戴者的视角捕捉场景,其中视线是关键的非言语交流线索,反映了视觉注意力,并提供了有关人类意图和认知的见解。受此启发,我们提出了一种新颖的方法GazeNLQ,该方法利用视线来检索与给定自然语言查询相匹配的视频片段。具体而言,我们引入了一种基于对比学习的预训练策略,直接从视频中进行视线估计。估算出的视线被用来增强模型中的视频表示,从而提高定位精度。实验结果显示,GazeNLQ在R1@IoU0.3和R1@IoU0.5评分上分别取得了27.82和18.68的成绩。我们的代码可在[此处](https://this_https_URL.com)获取。
https://arxiv.org/abs/2506.05782