While appearance-based gaze estimation has achieved significant improvements in accuracy and domain adaptation, the fairness of these systems across different demographic groups remains largely unexplored. To date, there is no comprehensive benchmark quantifying algorithmic bias in gaze estimation. This paper presents the first extensive evaluation of fairness in appearance-based gaze estimation, focusing on ethnicity and gender attributes. We establish a fairness baseline by analyzing state-of-the-art models using standard fairness metrics, revealing significant performance disparities. Furthermore, we evaluate the effectiveness of existing bias mitigation strategies when applied to the gaze domain and show that their fairness contributions are limited. We summarize key insights and open issues. Overall, our work calls for research into developing robust, equitable gaze estimators. To support future research and reproducibility, we publicly release our annotations, code, and trained models at: this http URL
https://arxiv.org/abs/2604.10707
Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.
眼动追踪(ET)在增强现实与虚拟现实应用中发挥着关键作用。然而,由于硬件配置(如摄像头位置、姿态及光照条件)常随设备迭代而变化,快速为新产品部署高精度、端侧眼动估计仍是一项挑战。视觉基础模型(VFMs)为快速训练与部署提供了前景方向,其在自然图像基准测试中表现优异;但我们发现,现成VFMs在专业近眼红外图像上仍难以实现高精度。为弥合这一差距,我们提出了DistillGaze框架,该框架通过利用标注合成数据与无标注真实数据进行知识蒸馏,实现快速且高性能的端侧眼动估计。DistillGaze分两阶段进行:首先,利用标注合成图像与无标注真实图像进行自监督学习,将视觉基础模型适配为领域专用的教师模型——合成数据提供可扩展的高质量眼动监督,而无标注真实数据则有助于缩小合成与真实领域间的差异;其次,结合教师引导与自训练方法训练端侧学生模型。在涵盖2000多名参与者的大规模众包数据集上的评估表明,DistillGaze相较于仅使用合成数据的基线方法,将中位眼动误差降低了58.62%,同时保持了仅25.6万参数的轻量模型,适用于实时端侧部署。总体而言,DistillGaze为适应硬件变化的眼动追踪模型训练与部署提供了一条高效路径,并为端侧回归任务中结合合成监督与无标注真实数据提供了可行方案。
https://arxiv.org/abs/2604.02509
Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.
基于外观的注视估计(AGE)在受限场景中已取得显著性能,但我们发现其存在明显的泛化差距——现有AGE模型在实际非受限场景中常表现不佳,尤其涉及面部穿戴设备和光照条件恶劣的情况。我们将此归因于两个核心因素:不同数据集间图像多样性有限且标签保真度不一致,尤其在俯仰轴方向。为应对这些问题,我们提出一种无需额外人工标注数据的鲁棒AGE框架以提升泛化能力。首先,通过集成多种增强技术扩展图像流形,包括合成眼镜、口罩及多样化光照。其次,为缓解数据集间各向异性的标签偏差影响,我们将注视回归重构为多任务学习问题,引入多视角监督对比(SupCon)学习、离散化标签分类及眼周区域分割作为辅助目标。为严格验证方法有效性,我们构建了新的基准数据集,专门评估挑战性条件下的注视鲁棒性——这是现有评估协议普遍忽视的维度。基于MobileNet的轻量模型在泛化性能上可与当前最优的UniGaze-H竞争,而参数量不足其1%,实现了移动端高保真实时注视追踪。
https://arxiv.org/abs/2603.26945
Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at this https URL.
基于外观的注视估计通常依赖于深度卷积神经网络(CNN)。这些模型虽然准确,但计算成本高昂且如同“黑箱模型”,可解释性不足。基于面部标志点的几何方法是一种轻量级替代方案,但其性能极限和泛化能力在现代基准测试中尚未得到充分探索。本研究对基于标志点的注视估计进行了全面评估。我们引入标准化流程,从三个大规模数据集(Gaze360、ETH-XGaze和GazeGene)中提取并归一化标志点,并训练轻量级回归模型,具体包括极端梯度提升树以及两种神经架构:整体式多层感知机(MLP)和用于捕捉双目几何特征的孪生MLP。研究发现,在域内评估中,基于标志点的模型表现较低,这可能源于标志点检测器引入的数据噪声。然而,在跨域评估中,所提出的MLP架构展现出与ResNet18基线相当的泛化能力。这些发现表明,稀疏几何特征编码了足够的信息以实现稳健的注视估计,为高效、可解释且隐私友好型的边缘应用铺平了道路。源代码及生成的基于标志点的数据集可通过此https URL获取。
https://arxiv.org/abs/2603.24724
View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human cortical dorsal pathway. Subsequently, the complementary view representations of both streams are integrated to determine the final actions, enabling the model to handle spatially-complex and dynamically-changing tasks under language conditions. Empirical evaluations on RLBench, the challenging COLOSSEUM benchmark, and real-world tasks demonstrate that Cortical Policy outperforms state-of-the-art baselines substantially, validating the superiority of dual-stream design for visuomotor control. Our cortex-inspired framework offers a fresh perspective for robotic manipulation and holds potential for broader application in vision-based robot control.
多视角变换器通过处理多视角观测来预测动作,在机器人操作领域展现出卓越性能。现有方法通常以视角特定的方式提取静态视觉表征,导致三维空间推理能力不足且缺乏动态适应性。受人类大脑整合静态与动态视图解决此类挑战的启发,我们提出Cortical Policy(皮层策略)——一种用于机器人操作的新型双流多视角变换器,能够从静态视图流与动态视图流进行联合推理。静态视图流通过对齐从预训练三维基础模型中提取的几何一致关键点特征来增强空间理解;动态视图流则通过自我中心注视估计模型的位置感知预训练实现自适应调整,在计算层面复现了人类大脑皮层背侧通路的工作机制。随后,双流互补的视图表征被融合以确定最终动作,使模型能够在语言条件下处理空间复杂且动态变化的任务。在RLBench、具有挑战性的COLOSSEUM基准测试及真实世界任务上的实证评估表明,Cortical Policy显著优于现有最先进基线,验证了双流设计在视觉运动控制中的优越性。我们受大脑皮层启发的框架为机器人操作提供了全新视角,并有望在基于视觉的机器人控制领域实现更广泛应用。
https://arxiv.org/abs/2603.21051
Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Net(Fusion Global Information). The model fuses global information into the CNN, effectively compensating for the need of multi-layer convolution and pooling to indirectly capture global information, while reducing the complexity of the model, improving the model accuracy and convergence speed. To validate the performance of the model, a large number of experiments are conducted, comparing accuracy with existing classical models and lightweight models, comparing convergence speed with models of different architectures, and conducting ablation experiments. Experimental results show that compared with GazeCaps, the latest gaze estimation model, FGI-Net achieves a smaller angle error with 87.1% and 79.1% reduction in parameters and FLOPs, respectively (MPIIFaceGaze is 3.74°, EyeDiap is 5.15°, Gaze360 is 10.50° and RT-Gene is 6.02°). Moreover, compared with different architectural models such as CNN and Transformer, FGI-Net is able to quickly converge to a higher accuracy range with fewer iterations of training, when achieving optimal accuracy on the Gaze360 and EyeDiap datasets, the FGI-Net model has 25% and 37.5% fewer iterations of training compared to GazeTR, respectively.
基于深度学习的外观注视估计方法因其高精度和环境约束较少而日益流行。然而,现有高精度模型往往依赖更深的网络,导致参数量大、训练时间长及收敛慢等问题。针对此问题,本文提出一种新型轻量级注视估计模型FGI-Net(融合全局信息)。该模型将全局信息融入CNN中,有效弥补了通过多层卷积与池化间接捕获全局信息的需求,同时降低了模型复杂度,提升了模型精度与收敛速度。为验证模型性能,本文进行了大量实验:与现有经典模型及轻量级模型对比精度,与不同架构模型对比收敛速度,并开展消融实验。实验结果表明,相较于最新注视估计模型GazeCaps,FGI-Net在参数量与计算量上分别减少87.1%与79.1%,同时实现更小的角度误差(MPIIFaceGaze数据集为3.74°,EyeDiap为5.15°,Gaze360为10.50°,RT-Gene为6.02°)。此外,在与CNN、Transformer等不同架构模型的对比中,FGI-Net能以更少的训练迭代次数快速收敛至更高精度区间——在Gaze360与EyeDiap数据集达到最优精度时,其训练迭代次数较GazeTR分别减少25%与37.5%。
https://arxiv.org/abs/2411.18064
Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.
在线约会已成为 romantic relationships 开始的主要方式,但现有平台剥离了人类赖以 signaling comfort, disinterest, and consent 的非语言线索:目光、面部表情、身体姿态、回应时机,从而造成了沟通鸿沟,并对女性产生不成比例的安全后果。我们认为,这一鸿沟既代表了计算机视觉社区的技术机遇,也构成了其道德责任——该社区已开发出解决此问题所需的情感计算工具:面部动作单元检测、视线估计、参与度建模与多模态情感识别,却 largely ignored 将约会领域作为研究背景。我们提出一个以公平优先的研究议程,围绕四个能力领域组织:实时不适检测、伴侣间参与度不对称建模、注重同意的交互设计、以及纵向交互摘要,每个领域都基于成熟的 CV 方法论,并受浪漫沟通的社会心理学驱动。我们主张,负责任地推进此议程需要:在双主体同意协议下收集的专用数据集;按种族、性别认同、神经类型和文化背景 disaggregated 的公平性评估;以及坚持设备端处理的架构设计,以防止情感数据沦为平台监控基础设施。这篇愿景论文呼吁 WICV 社区——其成员既理解技术机遇又深知人文关切的独特位置——在商业部署超越伦理审议前,将在线约会安全确立为一级研究领域。
https://arxiv.org/abs/2603.26727
We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device's screen in the user's eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user's screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.
我们提出了一种在消费级计算设备上进行视线估计的新且精确的方法。我们利用了面向用户的摄像头(例如智能手机、笔记本电脑和台式机中的摄像头)在质量上的持续进步——高端设备中已达到4K或更高分辨率——使得现在可以捕捉到用户眼中设备屏幕的二维反射像。然而仅凭这一点还不足以实现精确的视线追踪,因为屏幕内容近乎无限多样。然而关键在于,设备自身知道屏幕上正在显示的内容——在本研究中,我们证明这一信息能够实现反射像的鲁棒分割,而反射像的位置和尺寸编码了用户相对于屏幕的注视目标。我们探索了多种利用这一有效信号的策略,并通过用户研究量化了性能。我们表现最佳的模型相比基线基于外观的模型,将平均追踪误差降低了约8%。一项补充研究揭示,若视线追踪摄像头位于设备底部,还可实现额外10-20%的改进。
https://arxiv.org/abs/2603.19588
We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: this https URL.
https://arxiv.org/abs/2603.17161
In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.
https://arxiv.org/abs/2603.15951
Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.
https://arxiv.org/abs/2603.15717
Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.
预训练的眼动模型能够学习识别跨用户常用的有用模式,但细微的个体差异(如眼睑形状或面部结构)可能会降低模型性能。测试时间个性化(TTP)通过少量未标记样本使预训练模型适应这些个体特定领域变化。高效的微调对于执行这种域适应至关重要:数据和计算资源可能有限,特别是在设备上的定制时更为明显。虽然流行的参数高效微调(PEFT)方法通过仅更新一组小权重来解决适应成本问题,但它们可能未能充分利用预先训练的过滤器中编码的结构信息。为了更有效地利用预训练期间学习到的现有结构,我们将个性化重新定义为一个重新加权现有特征的过程,而不是完全学习新的特征。 我们提出了注意力低秩滤波器调整(Alfa),通过重新加权预训练过滤器中的语义模式来适应眼动模型。借助Alfa,奇异值分解(SVD)提取了主要的空间成分,这些成分捕捉到跨用户的眼部和面部特性。通过一个注意机制,在仅使用少量未标记样本的情况下,我们就可以调整并重新加权预先训练的结构,并且选择性地放大与目标用户相关的部分。 在四个跨数据集眼动基准测试中,Alfa实现了最低的平均眼动误差,优于现有的TTP方法和低秩适应(LoRA)变体。此外,我们也展示了Alfa的注意力低秩方法可以应用于视觉以外的应用程序,例如基于扩散的语言模型。
https://arxiv.org/abs/2603.08445
Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at this https URL.
注视估计在现代虚拟现实(VR)系统中至关重要。尽管远程摄像头注视估计方面已取得重大进展,但VR注视研究仍然受限于数据稀缺问题——特别是缺乏大规模、准确标注的数据集,这些数据集使用的是现代头显典型的非轴心相机配置所捕获的图像。由于难以保证用户能够固定在预设的目标上,因此注释工作也变得困难重重。为了解决这些问题,我们推出了VRGaze——首个用于VR的大规模非轴心注视估计数据集,该数据集中包含从68名参与者收集的210万张近眼红外图像。此外,我们还提出了GazeShift框架,这是一种注意力引导型无监督学习方法,可以在没有标注数据的情况下学习注视表示形式。不同于依赖于多视角或三维几何原理的传统重定向方法,GazeShift专门针对近眼红外成像技术进行了优化,在紧凑的实时模型中实现了有效的注视-外观解耦。通过轻量级的少量样本校准,GazeShift嵌入式可以个性化适应各个用户,使其在VRGaze数据集上的平均误差仅为1.84度。在远程摄像头MPIIGaze数据集上,该模型也表现出色,其对个人身份无依赖的错误率为7.15度,而使用的参数量和计算量(FLOPs)分别是基线方法的十分之一和三十五分之一。当直接部署到VR头显GPU上时,推理过程仅需花费5毫秒的时间。结合其在光照变化下的鲁棒性表现来看,这些结果突显了GazeShift作为一种标签效率高、实时性的解决方案,在VR注视追踪中的优势。项目代码及VRGaze数据集可在提供的网址上获取。
https://arxiv.org/abs/2603.07832
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at this https URL
从可见图像中估计人类视线目标是机器人理解人类注意力的关键任务,然而开发通用的神经网络架构和训练模式仍然具有挑战性。虽然近期在预训练视觉基础模型方面的进展为定位注视点提供了有前景的方法,但整合多模态线索——包括眼睛、头部姿态、手势及上下文特征——则需要灵活高效的解码机制。 受混合专家(Mixture-of-Experts, MoE)在大型视觉语言模型中适应领域专业知识的启发,我们提出了GazeMoE,这是一种新颖的端到端框架。该框架通过MoE模块选择性地利用一个冻结的基础模型中的与注视点相关的线索。为了应对视线目标分类中的类别不平衡问题(即画面内与画面外)以及增强其鲁棒性,GazeMoE引入了类平衡辅助损失,并采用了特定区域裁剪和光度变换等策略性的数据增强方法。 在基准数据集上进行的广泛实验表明,我们的GazeMoE框架实现了最先进的性能,在具有挑战性的视线估计任务中超越现有方法。相关代码与预训练模型可以在以下网址获取:[此 URL]
https://arxiv.org/abs/2603.06256
This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.
这篇论文介绍了颈戴式视角凝视估计,这是一种从颈部佩戴摄像头的角度来估算用户视线的新任务。以往关于第一人称凝视估计的研究主要集中在头戴式摄像头上,并且预测设备佩戴者在摄像机视野内的凝视位置,而其他视角则鲜有探索。为了填补这一空白,我们收集了首个用于该任务的数据集,其中包括来自8名参与者在日常活动中拍摄的约4小时视频资料。 我们在新数据集上评估了一种基于Transformer的凝视估计模型GLC,并提出了两项扩展:辅助的视线超出边界分类任务和多视角协同学习方法,后者通过一种感知几何关系的辅助损失共同训练头部视角和颈部视角的模型。实验结果显示,加入视线超出边界分类任务能够改进标准微调的结果,而协同学习的方法并未带来性能提升。 我们进一步分析了这些结果,并讨论了颈戴式凝视估计技术的意义。
https://arxiv.org/abs/2602.11669
Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
在线自拍视角估计是从第一人称视频中预测佩戴摄像头的人正在看向哪里,仅使用过去的和当前的帧来完成的任务。这一任务对于增强现实和辅助技术至关重要。与第三视角的视线估计不同,在这种情况下缺乏明确的头部或眼睛信号,模型需要从稀疏且间接的线索(如手部与物体的互动、场景中的显著内容)中推断当前的视觉注意点。 我们观察到,在目标导向活动中,注视表现出强烈的时间连续性:了解一个人最近看向哪里可以为预测他们下一步会看向哪里提供强有力的先验知识。受视觉条件下的自回归解码在视觉语言模型中的启发,我们提出了ARGaze方法,该方法将视线估计重新表述为序列预测问题:每一步,一个变压器解码器通过两个条件来预测当前的注视方向——(i)当前视觉特征和(ii)近期注视目标估计的一个固定长度的时间窗口(称为Gaze Context Window)。这种设计强制执行因果关系,并使有界资源下的流式推理成为可能。 在多个自拍基准测试中,我们实现了在线评估中的最佳性能,并通过广泛的消融实验验证了具有有限历史记录的自回归建模对于稳健预测的关键作用。我们将发布我们的源代码和预训练模型。
https://arxiv.org/abs/2602.05132
Shared control improves Human-Robot Interaction by reducing the user's workload and increasing the robot's autonomy. It allows robots to perform tasks under the user's supervision. Current eye-tracking-driven approaches face several challenges. These include accuracy issues in 3D gaze estimation and difficulty interpreting gaze when differentiating between multiple tasks. We present an eye-tracking-driven control framework, aimed at enabling individuals with severe physical disabilities to perform daily tasks independently. Our system uses task pictograms as fiducial markers combined with a feature matching approach that transmits data of the selected object to accomplish necessary task related measurements with an eye-in-hand configuration. This eye-tracking control does not require knowledge of the user's position in relation to the object. The framework correctly interpreted object and task selection in up to 97.9% of measurements. Issues were found in the evaluation, that were improved and shared as lessons learned. The open-source framework can be adapted to new tasks and objects due to the integration of state-of-the-art object detection models.
https://arxiv.org/abs/2601.17404
We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at this https URL.
我们介绍了GazeD,这是一种新的3D凝视估计方法,可以从单个RGB图像中同时提供3D凝视方向和人体姿态。通过利用扩散模型处理不确定性的能力,它基于从输入图像提取的2D上下文信息生成多个合理的3D凝视方向和姿势假设。具体来说,我们在去噪过程中以2D姿态、主体周围的环境以及场景背景作为条件进行操作。使用GazeD,我们还引入了一种新的表示3D凝视的方法,即将其定位为距眼睛固定距离的一个额外身体关节。理由是凝视通常与姿态密切相关,因此在扩散处理中共同去噪可以获益。跨三个基准数据集的评估表明,GazeD在3D凝视估计方面达到了最先进的性能,甚至超越了依赖于时间信息的方法。项目详情将在[此处](https://this https URL)提供。
https://arxiv.org/abs/2601.12948
We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49°, 3.22°, 10.16°, and 1.44°, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.
我们提出了一种语义调节的多尺度Transformer,用于3D注视点估计。我们的模型使用可学习的原型库(光照、头部姿态、背景和方向)来调整CLIP全局特征,并将这些经过丰富处理的全局向量与CLIP补丁标记和高分辨率CNN令牌在统一注意空间中融合在一起。此外,我们用路由/共享专家混合体替换了若干前馈网络块以增加条件容量。 我们在MPIIFaceGaze、EYEDIAP、Gaze360和ETH-XGaze数据集上对模型进行了评估,我们的模型分别实现了2.49°、3.22°、10.16°和1.44°的新最佳角误差,相较于先前报告的结果有了高达64%的相对改进。消融研究表明,原型调节、跨尺度融合、专家混合体以及超参数调整是性能提升的关键因素。 我们的代码在 https://github.com/AIPMLab/Gazeformer 上公开可用。
https://arxiv.org/abs/2601.12316
We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.
我们介绍EyeTheia,这是一个轻量级的开源深度学习流水线,用于基于网络摄像头的眼动追踪估计,旨在为浏览器基础实验平台和现实世界中的认知及临床研究设计。EyeTheia能够仅使用标准笔记本电脑内置网络摄像头实现实时眼动追踪,它结合了MediaPipe基元的特征点提取与受iTracker启发的卷积神经网络,并提供可选的用户特定微调选项。 我们调查了两种互补策略:一种是在移动数据上预训练模型进行适应,另一种是从头开始在以桌面为中心的数据集上训练相同架构。在MPIIFaceGaze验证结果表明,在校准之前,这两种方法表现出相当的性能,而轻量级的用户特定微调选项能持续减少眼动预测误差。 我们进一步在一个现实的Dot-Probe任务中评估EyeTheia,并将其与商用网络摄像头眼动追踪器SeeSo SDK进行比较。结果显示,在呈现刺激时,左右眼动分配有强烈的符合性,尽管时间变异性较高。总体而言,EyeTheia为低成本眼动追踪提供了一个透明且可扩展的解决方案,适合大规模和可重复性的实验及临床研究。 该项目的代码、训练好的模型以及实验材料均公开可用。
https://arxiv.org/abs/2601.06279