We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.
https://arxiv.org/abs/2605.07188
While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at this https URL.
https://arxiv.org/abs/2605.04770
Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture. Specifically, the model first introduces semantic prototype conditioning, which modulates the CLIP global image embedding using four learned prototype banks (i.e., illumination, background, head pose and appearance) to generate two complementary context-biased global tokens. These tokens, along with the CLIP patch and CNN tokens, are fused at the first layer. This early unified fusion prevents information loss common in late-stage merging. Finally, each token passes through sparse Mixture-of-Experts modules, providing conditional computational capacity without uniformly increasing dense parameters. For cross-domain adaptation, we incorporate an adversarial domain adaptation technique with a feature separation loss that encourages the two global tokens to remain de-correlated. Experiments using four public benchmarks (MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze) show that GMGaze achieves mean angular errors of 2.49$^\circ$, 3.22$^\circ$, 10.16$^\circ$, and 1.44$^\circ$, respectively, outperforming previous baselines in all within-domain settings. In cross-domain evaluations, it provides state-of-the-art (SOTA) results on two standard transfer routes.
https://arxiv.org/abs/2605.00799
Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.
https://arxiv.org/abs/2604.19888
Generalizable gaze estimation methods have garnered increasing attention due to their critical importance in real-world applications and have achieved significant progress. However, they often overlook the effect of label noise, arising from the inherent difficulty of acquiring precise gaze annotations, on model generalization performance. In this paper, we are the first to comprehensively investigate the negative effects of label noise on generalization in gaze estimation. Further, we propose a novel solution, called See-Through-Noise (SeeTN) framework, which improves generalization from a novel perspective of mitigating label noise. Specifically, we propose to construct a semantic embedding space via a prototype-based transformation to preserve a consistent topological structure between gaze features and continuous labels. We then measure feature-label affinity consistency to distinguish noisy from clean samples, and introduce a novel affinity regularization in the semantic manifold to transfer gaze-related information from clean to noisy samples. Our proposed SeeTN promotes semantic structure alignment and enforces domain-invariant gaze relationships, thereby enhancing robustness against label noise. Extensive experiments demonstrate that our SeeTN effectively mitigates the adverse impact of source-domain noise, leading to superior cross-domain generalization without compromising the source-domain accuracy, and highlight the importance of explicitly handling noise in generalized gaze estimation.
https://arxiv.org/abs/2604.16562
While appearance-based gaze estimation has achieved significant improvements in accuracy and domain adaptation, the fairness of these systems across different demographic groups remains largely unexplored. To date, there is no comprehensive benchmark quantifying algorithmic bias in gaze estimation. This paper presents the first extensive evaluation of fairness in appearance-based gaze estimation, focusing on ethnicity and gender attributes. We establish a fairness baseline by analyzing state-of-the-art models using standard fairness metrics, revealing significant performance disparities. Furthermore, we evaluate the effectiveness of existing bias mitigation strategies when applied to the gaze domain and show that their fairness contributions are limited. We summarize key insights and open issues. Overall, our work calls for research into developing robust, equitable gaze estimators. To support future research and reproducibility, we publicly release our annotations, code, and trained models at: this http URL
https://arxiv.org/abs/2604.10707
Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.
眼动追踪(ET)在增强现实与虚拟现实应用中发挥着关键作用。然而,由于硬件配置(如摄像头位置、姿态及光照条件)常随设备迭代而变化,快速为新产品部署高精度、端侧眼动估计仍是一项挑战。视觉基础模型(VFMs)为快速训练与部署提供了前景方向,其在自然图像基准测试中表现优异;但我们发现,现成VFMs在专业近眼红外图像上仍难以实现高精度。为弥合这一差距,我们提出了DistillGaze框架,该框架通过利用标注合成数据与无标注真实数据进行知识蒸馏,实现快速且高性能的端侧眼动估计。DistillGaze分两阶段进行:首先,利用标注合成图像与无标注真实图像进行自监督学习,将视觉基础模型适配为领域专用的教师模型——合成数据提供可扩展的高质量眼动监督,而无标注真实数据则有助于缩小合成与真实领域间的差异;其次,结合教师引导与自训练方法训练端侧学生模型。在涵盖2000多名参与者的大规模众包数据集上的评估表明,DistillGaze相较于仅使用合成数据的基线方法,将中位眼动误差降低了58.62%,同时保持了仅25.6万参数的轻量模型,适用于实时端侧部署。总体而言,DistillGaze为适应硬件变化的眼动追踪模型训练与部署提供了一条高效路径,并为端侧回归任务中结合合成监督与无标注真实数据提供了可行方案。
https://arxiv.org/abs/2604.02509
Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.
基于外观的注视估计(AGE)在受限场景中已取得显著性能,但我们发现其存在明显的泛化差距——现有AGE模型在实际非受限场景中常表现不佳,尤其涉及面部穿戴设备和光照条件恶劣的情况。我们将此归因于两个核心因素:不同数据集间图像多样性有限且标签保真度不一致,尤其在俯仰轴方向。为应对这些问题,我们提出一种无需额外人工标注数据的鲁棒AGE框架以提升泛化能力。首先,通过集成多种增强技术扩展图像流形,包括合成眼镜、口罩及多样化光照。其次,为缓解数据集间各向异性的标签偏差影响,我们将注视回归重构为多任务学习问题,引入多视角监督对比(SupCon)学习、离散化标签分类及眼周区域分割作为辅助目标。为严格验证方法有效性,我们构建了新的基准数据集,专门评估挑战性条件下的注视鲁棒性——这是现有评估协议普遍忽视的维度。基于MobileNet的轻量模型在泛化性能上可与当前最优的UniGaze-H竞争,而参数量不足其1%,实现了移动端高保真实时注视追踪。
https://arxiv.org/abs/2603.26945
Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at this https URL.
基于外观的注视估计通常依赖于深度卷积神经网络(CNN)。这些模型虽然准确,但计算成本高昂且如同“黑箱模型”,可解释性不足。基于面部标志点的几何方法是一种轻量级替代方案,但其性能极限和泛化能力在现代基准测试中尚未得到充分探索。本研究对基于标志点的注视估计进行了全面评估。我们引入标准化流程,从三个大规模数据集(Gaze360、ETH-XGaze和GazeGene)中提取并归一化标志点,并训练轻量级回归模型,具体包括极端梯度提升树以及两种神经架构:整体式多层感知机(MLP)和用于捕捉双目几何特征的孪生MLP。研究发现,在域内评估中,基于标志点的模型表现较低,这可能源于标志点检测器引入的数据噪声。然而,在跨域评估中,所提出的MLP架构展现出与ResNet18基线相当的泛化能力。这些发现表明,稀疏几何特征编码了足够的信息以实现稳健的注视估计,为高效、可解释且隐私友好型的边缘应用铺平了道路。源代码及生成的基于标志点的数据集可通过此https URL获取。
https://arxiv.org/abs/2603.24724
View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human cortical dorsal pathway. Subsequently, the complementary view representations of both streams are integrated to determine the final actions, enabling the model to handle spatially-complex and dynamically-changing tasks under language conditions. Empirical evaluations on RLBench, the challenging COLOSSEUM benchmark, and real-world tasks demonstrate that Cortical Policy outperforms state-of-the-art baselines substantially, validating the superiority of dual-stream design for visuomotor control. Our cortex-inspired framework offers a fresh perspective for robotic manipulation and holds potential for broader application in vision-based robot control.
多视角变换器通过处理多视角观测来预测动作,在机器人操作领域展现出卓越性能。现有方法通常以视角特定的方式提取静态视觉表征,导致三维空间推理能力不足且缺乏动态适应性。受人类大脑整合静态与动态视图解决此类挑战的启发,我们提出Cortical Policy(皮层策略)——一种用于机器人操作的新型双流多视角变换器,能够从静态视图流与动态视图流进行联合推理。静态视图流通过对齐从预训练三维基础模型中提取的几何一致关键点特征来增强空间理解;动态视图流则通过自我中心注视估计模型的位置感知预训练实现自适应调整,在计算层面复现了人类大脑皮层背侧通路的工作机制。随后,双流互补的视图表征被融合以确定最终动作,使模型能够在语言条件下处理空间复杂且动态变化的任务。在RLBench、具有挑战性的COLOSSEUM基准测试及真实世界任务上的实证评估表明,Cortical Policy显著优于现有最先进基线,验证了双流设计在视觉运动控制中的优越性。我们受大脑皮层启发的框架为机器人操作提供了全新视角,并有望在基于视觉的机器人控制领域实现更广泛应用。
https://arxiv.org/abs/2603.21051
Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Net(Fusion Global Information). The model fuses global information into the CNN, effectively compensating for the need of multi-layer convolution and pooling to indirectly capture global information, while reducing the complexity of the model, improving the model accuracy and convergence speed. To validate the performance of the model, a large number of experiments are conducted, comparing accuracy with existing classical models and lightweight models, comparing convergence speed with models of different architectures, and conducting ablation experiments. Experimental results show that compared with GazeCaps, the latest gaze estimation model, FGI-Net achieves a smaller angle error with 87.1% and 79.1% reduction in parameters and FLOPs, respectively (MPIIFaceGaze is 3.74°, EyeDiap is 5.15°, Gaze360 is 10.50° and RT-Gene is 6.02°). Moreover, compared with different architectural models such as CNN and Transformer, FGI-Net is able to quickly converge to a higher accuracy range with fewer iterations of training, when achieving optimal accuracy on the Gaze360 and EyeDiap datasets, the FGI-Net model has 25% and 37.5% fewer iterations of training compared to GazeTR, respectively.
基于深度学习的外观注视估计方法因其高精度和环境约束较少而日益流行。然而,现有高精度模型往往依赖更深的网络,导致参数量大、训练时间长及收敛慢等问题。针对此问题,本文提出一种新型轻量级注视估计模型FGI-Net(融合全局信息)。该模型将全局信息融入CNN中,有效弥补了通过多层卷积与池化间接捕获全局信息的需求,同时降低了模型复杂度,提升了模型精度与收敛速度。为验证模型性能,本文进行了大量实验:与现有经典模型及轻量级模型对比精度,与不同架构模型对比收敛速度,并开展消融实验。实验结果表明,相较于最新注视估计模型GazeCaps,FGI-Net在参数量与计算量上分别减少87.1%与79.1%,同时实现更小的角度误差(MPIIFaceGaze数据集为3.74°,EyeDiap为5.15°,Gaze360为10.50°,RT-Gene为6.02°)。此外,在与CNN、Transformer等不同架构模型的对比中,FGI-Net能以更少的训练迭代次数快速收敛至更高精度区间——在Gaze360与EyeDiap数据集达到最优精度时,其训练迭代次数较GazeTR分别减少25%与37.5%。
https://arxiv.org/abs/2411.18064
Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.
在线约会已成为 romantic relationships 开始的主要方式,但现有平台剥离了人类赖以 signaling comfort, disinterest, and consent 的非语言线索:目光、面部表情、身体姿态、回应时机,从而造成了沟通鸿沟,并对女性产生不成比例的安全后果。我们认为,这一鸿沟既代表了计算机视觉社区的技术机遇,也构成了其道德责任——该社区已开发出解决此问题所需的情感计算工具:面部动作单元检测、视线估计、参与度建模与多模态情感识别,却 largely ignored 将约会领域作为研究背景。我们提出一个以公平优先的研究议程,围绕四个能力领域组织:实时不适检测、伴侣间参与度不对称建模、注重同意的交互设计、以及纵向交互摘要,每个领域都基于成熟的 CV 方法论,并受浪漫沟通的社会心理学驱动。我们主张,负责任地推进此议程需要:在双主体同意协议下收集的专用数据集;按种族、性别认同、神经类型和文化背景 disaggregated 的公平性评估;以及坚持设备端处理的架构设计,以防止情感数据沦为平台监控基础设施。这篇愿景论文呼吁 WICV 社区——其成员既理解技术机遇又深知人文关切的独特位置——在商业部署超越伦理审议前,将在线约会安全确立为一级研究领域。
https://arxiv.org/abs/2603.26727
We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device's screen in the user's eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user's screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.
我们提出了一种在消费级计算设备上进行视线估计的新且精确的方法。我们利用了面向用户的摄像头(例如智能手机、笔记本电脑和台式机中的摄像头)在质量上的持续进步——高端设备中已达到4K或更高分辨率——使得现在可以捕捉到用户眼中设备屏幕的二维反射像。然而仅凭这一点还不足以实现精确的视线追踪,因为屏幕内容近乎无限多样。然而关键在于,设备自身知道屏幕上正在显示的内容——在本研究中,我们证明这一信息能够实现反射像的鲁棒分割,而反射像的位置和尺寸编码了用户相对于屏幕的注视目标。我们探索了多种利用这一有效信号的策略,并通过用户研究量化了性能。我们表现最佳的模型相比基线基于外观的模型,将平均追踪误差降低了约8%。一项补充研究揭示,若视线追踪摄像头位于设备底部,还可实现额外10-20%的改进。
https://arxiv.org/abs/2603.19588
We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: this https URL.
https://arxiv.org/abs/2603.17161
In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.
https://arxiv.org/abs/2603.15951
Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.
https://arxiv.org/abs/2603.15717
Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.
预训练的眼动模型能够学习识别跨用户常用的有用模式,但细微的个体差异(如眼睑形状或面部结构)可能会降低模型性能。测试时间个性化(TTP)通过少量未标记样本使预训练模型适应这些个体特定领域变化。高效的微调对于执行这种域适应至关重要:数据和计算资源可能有限,特别是在设备上的定制时更为明显。虽然流行的参数高效微调(PEFT)方法通过仅更新一组小权重来解决适应成本问题,但它们可能未能充分利用预先训练的过滤器中编码的结构信息。为了更有效地利用预训练期间学习到的现有结构,我们将个性化重新定义为一个重新加权现有特征的过程,而不是完全学习新的特征。 我们提出了注意力低秩滤波器调整(Alfa),通过重新加权预训练过滤器中的语义模式来适应眼动模型。借助Alfa,奇异值分解(SVD)提取了主要的空间成分,这些成分捕捉到跨用户的眼部和面部特性。通过一个注意机制,在仅使用少量未标记样本的情况下,我们就可以调整并重新加权预先训练的结构,并且选择性地放大与目标用户相关的部分。 在四个跨数据集眼动基准测试中,Alfa实现了最低的平均眼动误差,优于现有的TTP方法和低秩适应(LoRA)变体。此外,我们也展示了Alfa的注意力低秩方法可以应用于视觉以外的应用程序,例如基于扩散的语言模型。
https://arxiv.org/abs/2603.08445
Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at this https URL.
注视估计在现代虚拟现实(VR)系统中至关重要。尽管远程摄像头注视估计方面已取得重大进展,但VR注视研究仍然受限于数据稀缺问题——特别是缺乏大规模、准确标注的数据集,这些数据集使用的是现代头显典型的非轴心相机配置所捕获的图像。由于难以保证用户能够固定在预设的目标上,因此注释工作也变得困难重重。为了解决这些问题,我们推出了VRGaze——首个用于VR的大规模非轴心注视估计数据集,该数据集中包含从68名参与者收集的210万张近眼红外图像。此外,我们还提出了GazeShift框架,这是一种注意力引导型无监督学习方法,可以在没有标注数据的情况下学习注视表示形式。不同于依赖于多视角或三维几何原理的传统重定向方法,GazeShift专门针对近眼红外成像技术进行了优化,在紧凑的实时模型中实现了有效的注视-外观解耦。通过轻量级的少量样本校准,GazeShift嵌入式可以个性化适应各个用户,使其在VRGaze数据集上的平均误差仅为1.84度。在远程摄像头MPIIGaze数据集上,该模型也表现出色,其对个人身份无依赖的错误率为7.15度,而使用的参数量和计算量(FLOPs)分别是基线方法的十分之一和三十五分之一。当直接部署到VR头显GPU上时,推理过程仅需花费5毫秒的时间。结合其在光照变化下的鲁棒性表现来看,这些结果突显了GazeShift作为一种标签效率高、实时性的解决方案,在VR注视追踪中的优势。项目代码及VRGaze数据集可在提供的网址上获取。
https://arxiv.org/abs/2603.07832
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at this https URL
从可见图像中估计人类视线目标是机器人理解人类注意力的关键任务,然而开发通用的神经网络架构和训练模式仍然具有挑战性。虽然近期在预训练视觉基础模型方面的进展为定位注视点提供了有前景的方法,但整合多模态线索——包括眼睛、头部姿态、手势及上下文特征——则需要灵活高效的解码机制。 受混合专家(Mixture-of-Experts, MoE)在大型视觉语言模型中适应领域专业知识的启发,我们提出了GazeMoE,这是一种新颖的端到端框架。该框架通过MoE模块选择性地利用一个冻结的基础模型中的与注视点相关的线索。为了应对视线目标分类中的类别不平衡问题(即画面内与画面外)以及增强其鲁棒性,GazeMoE引入了类平衡辅助损失,并采用了特定区域裁剪和光度变换等策略性的数据增强方法。 在基准数据集上进行的广泛实验表明,我们的GazeMoE框架实现了最先进的性能,在具有挑战性的视线估计任务中超越现有方法。相关代码与预训练模型可以在以下网址获取:[此 URL]
https://arxiv.org/abs/2603.06256
This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.
这篇论文介绍了颈戴式视角凝视估计,这是一种从颈部佩戴摄像头的角度来估算用户视线的新任务。以往关于第一人称凝视估计的研究主要集中在头戴式摄像头上,并且预测设备佩戴者在摄像机视野内的凝视位置,而其他视角则鲜有探索。为了填补这一空白,我们收集了首个用于该任务的数据集,其中包括来自8名参与者在日常活动中拍摄的约4小时视频资料。 我们在新数据集上评估了一种基于Transformer的凝视估计模型GLC,并提出了两项扩展:辅助的视线超出边界分类任务和多视角协同学习方法,后者通过一种感知几何关系的辅助损失共同训练头部视角和颈部视角的模型。实验结果显示,加入视线超出边界分类任务能够改进标准微调的结果,而协同学习的方法并未带来性能提升。 我们进一步分析了这些结果,并讨论了颈戴式凝视估计技术的意义。
https://arxiv.org/abs/2602.11669