Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person's face, thus promoting privacy preservation in various application contexts. The code is available at this https URL.
凝视目标检测(GTD)是一种具有挑战性的任务,特别是在三维空间中。现有的方法很大程度上依赖于分析人的外貌,主要集中在其脸上预测 gaze 目标。本文提出了一种新方法来解决这个问题,通过利用人的上半身姿态和可用的深度图来提取 3D gaze 方向,并采用多阶段或端到端管道来预测 gaze 目标。当预测准确时,人体姿态可以提供关于头姿的信息,这是 gaze 方向的近似,以及手臂和手的位置,这些与人们正在进行的活动和他们正在关注的物体有关。因此,我们在进行 3D 凝视估计的同时,也能够进行 GTD。我们在没有要求预测人面图像的公开可访问的三维凝视目标检测数据集上证明了最先进的结果,从而在各种应用场景中促进了隐私保护。代码可在此处访问:https://url.cn/xyz6h
https://arxiv.org/abs/2409.17886
Current video-based computer vision (CV) applications typically suffer from high energy consumption due to reading and processing all pixels in a frame, regardless of their significance. While previous works have attempted to reduce this energy by skipping input patches or pixels and using feedback from the end task to guide the skipping algorithm, the skipping is not performed during the sensor read phase. As a result, these methods can not optimize the front-end sensor energy. Moreover, they may not be suitable for real-time applications due to the long latency of modern CV networks that are deployed in the back-end. To address this challenge, this paper presents a custom-designed reconfigurable CMOS image sensor (CIS) system that improves energy efficiency by selectively skipping uneventful regions or rows within a frame during the sensor's readout phase, and the subsequent analog-to-digital conversion (ADC) phase. A novel masking algorithm intelligently directs the skipping process in real-time, optimizing both the front-end sensor and back-end neural networks for applications including autonomous driving and augmented/virtual reality (AR/VR). Our system can also operate in standard mode without skipping, depending on application needs. We evaluate our hardware-algorithm co-design framework on object detection based on BDD100K and ImageNetVID, and gaze estimation based on OpenEDS, achieving up to 53% reduction in front-end sensor energy while maintaining state-of-the-art (SOTA) accuracy.
当前基于视频的计算机视觉(CV)应用通常由于读取和处理帧中所有像素而消耗高能量,无论这些像素的重要性如何。虽然以前的工作试图通过跳过输入补丁或像素,并使用后任务的反馈来指导跳过算法来降低能量,但在传感器读取阶段并没有进行跳过。因此,这些方法无法优化前端传感器能量。此外,由于现代CV网络在后台部署,它们可能不适用于实时应用,因为它们具有较长的延迟。为了应对这一挑战,本文提出了一种自定义设计的可重构CMOS图像传感器(CIS)系统,通过在传感器读取阶段跳过帧中的不重要区域或行,以及在后续的模拟-数字转换(ADC)阶段优化前端传感器和后端神经网络,从而提高能量效率。一种新颖的掩码算法在实时过程中智能地指导跳过过程,为包括自动驾驶和增强/虚拟现实(AR/VR)在内的应用优化前端传感器和后端神经网络。我们的系统可以根据应用需要以非跳过方式运行。我们在基于BDD100K和ImageNetVID的物体检测和基于OpenEDS的凝视估计上进行了评估,实现了前端传感器能量减少53%的同时保持最佳(SOTA)准确性的效果。
https://arxiv.org/abs/2409.17341
The advent and growing popularity of Virtual Reality (VR) and Mixed Reality (MR) solutions have revolutionized the way we interact with digital platforms. The cutting-edge gaze-controlled typing methods, now prevalent in high-end models of these devices, e.g., Apple Vision Pro, have not only improved user experience but also mitigated traditional keystroke inference attacks that relied on hand gestures, head movements and acoustic side-channels. However, this advancement has paradoxically given birth to a new, potentially more insidious cyber threat, GAZEploit. In this paper, we unveil GAZEploit, a novel eye-tracking based attack specifically designed to exploit these eye-tracking information by leveraging the common use of virtual appearances in VR applications. This widespread usage significantly enhances the practicality and feasibility of our attack compared to existing methods. GAZEploit takes advantage of this vulnerability to remotely extract gaze estimations and steal sensitive keystroke information across various typing scenarios-including messages, passwords, URLs, emails, and passcodes. Our research, involving 30 participants, achieved over 80% accuracy in keystroke inference. Alarmingly, our study also identified over 15 top-rated apps in the Apple Store as vulnerable to the GAZEploit attack, emphasizing the urgent need for bolstered security measures for this state-of-the-art VR/MR text entry method.
虚拟现实(VR)和混合现实(MR)解决方案的的出现和不断增长的使用已经彻底改变了我们与数字平台互动的方式。现在这些设备高端型号中盛行的 gaze-controlled typing 方法,例如苹果视觉Pro,不仅提高了用户体验,还减轻了依赖于手势、头部运动和听觉信道的老式键盘推测攻击。然而,这一进步反而孕育了一个新的、可能更具破坏性的网络威胁,即 GAZEploit。 在本文中,我们揭示了 GAZEploit,一种专门利用虚拟应用程序中常见的虚拟形象来利用这些眼动信息的新型攻击方法。这种广泛的使用大大增强了我们的攻击相对于现有方法的实用性和可行性。GAZEploit 利用了这个漏洞,通过远程提取眼动估计并窃取敏感键盘信息来攻击各种打字场景——包括消息、密码、网址、邮件和密码。我们的研究包括 30 名参与者,在键位推测方面获得了超过 80% 的准确率。令人担忧的是,我们的研究还发现了苹果商店中排名前 15 的应用程序中有超过 15 个应用程序容易受到 GAZEploit 攻击,进一步强调了对于这种最先进的 VR/MR 文本输入方法需要加强安全措施的紧迫性。
https://arxiv.org/abs/2409.08122
Achieving accurate and reliable gaze predictions in complex and diverse environments remains challenging. Fortunately, it is straightforward to access diverse gaze datasets in real-world applications. We discover that training these datasets jointly can significantly improve the generalization of gaze estimation, which is overlooked in previous works. However, due to the inherent distribution shift across different datasets, simply mixing multiple dataset decreases the performance in the original domain despite gaining better generalization abilities. To address the problem of ``cross-dataset gaze estimation'', we propose a novel Evidential Inter-intra Fusion EIF framework, for training a cross-dataset model that performs well across all source and unseen domains. Specifically, we build independent single-dataset branches for various datasets where the data space is partitioned into overlapping subspaces within each dataset for local regression, and further create a cross-dataset branch to integrate the generalizable features from single-dataset branches. Furthermore, evidential regressors based on the Normal and Inverse-Gamma (NIG) distribution are designed to additionally provide uncertainty estimation apart from predicting gaze. Building upon this foundation, our proposed framework achieves both intra-evidential fusion among multiple local regressors within each dataset and inter-evidential fusion among multiple branches by Mixture \textbfof Normal Inverse-Gamma (MoNIG distribution. Experiments demonstrate that our method consistently achieves notable improvements in both source domains and unseen domains.
实现复杂且多样环境中准确可靠的眼神预测仍然具有挑战性。幸运的是,在现实应用中访问多样眼神数据集是直观的。我们发现,联合训练这些数据集可以显著提高眼神估计的泛化能力,这在之前的论文中被忽视了。然而,由于不同数据集中的固有分布差异,即使获得了更好的泛化能力,简单地将多个数据集混合也会导致在原始领域中的性能下降。为解决“跨数据集眼神估计”问题,我们提出了一个新的证据互信融合EIF框架,用于在所有源域和未见域中表现良好的跨数据集模型。具体来说,我们为各种数据集构建了独立的数据集分支,其中数据空间在每个数据集中被划分为重叠子空间进行局部回归,并进一步创建了跨数据集分支,将单数据集分支的通用特征整合到一起。此外,基于Normal和Inverse-Gamma(NIG)分布的证据回归器被设计为除了预测眼神外还提供不确定性估计。基于这个基础,我们提出的框架实现了每个数据集中的内部证据融合和跨分支之间的证据融合。实验证明,我们的方法在源域和未见域上都取得了显著的改进。
https://arxiv.org/abs/2409.04766
Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of head poses, gaze an-gles, illumination, etc. Another reason may be the inconsistency between methods used to define gaze angles (label mismatch). We propose two innovations to improve the performance of gaze estimation by leveraging multiple datasets, a change in the estimator architecture and the intro-duction of a gaze adaptation module. Most state-of-the-art estimators merge information extracted from images of the two eyes and the entire face either in parallel or combine information from the eyes first then with the face. Our proposed Two-stage Transformer-based Gaze-feature Fusion (TTGF) method uses transformers to merge information from each eye and the face separately and then merge across the two eyes. We argue that this improves head pose invariance since changes in head pose affect left and right eye images in different ways. Our proposed Gaze Adaptation Module (GAM) method handles annotation inconsis-tency by applying a Gaze Adaption Module for each dataset to correct gaze estimates from a single shared estimator. This enables us to combine information across datasets despite differences in labeling. Our experi-ments show that these innovations improve gaze estimation performance over the SOTA both individually and collectively (by 10% - 20%). Our code is available at this https URL.
为了训练和测试基于外观的注视估算器,已经创建了多个数据集。直觉上,更多的数据应该导致更好的性能。然而,将数据集合并来训练单个估计器 rarely 能够提高注视估算器的性能。一个可能的原因是用于获取 gaze 样本的实验协议不同,导致头部的分布、眼视角、照明等分布存在差异。另一个原因是定义 gaze angles 的方法不一致(标签不匹配)。我们提出了两种创新方法,以利用多个数据集来提高注视估算器的性能,包括对模型的架构进行修改和引入凝视适应模块。大多数最先进的估计算法将来自两个眼睛和整个脸部的图像的信息并行或先从眼睛的信息,然后与脸部信息合并。我们提出的 Two-stage Transformer-based Gaze-feature Fusion (TTGF) 方法使用转换器将来自每个眼睛和脸部的信息单独合并,然后在不同眼睛之间合并。我们认为,这改善了头部的姿态不变性,因为头部的姿态对左和右眼睛的图像有不同的影响。我们提出的 Gaze Adaptation Module (GAM) 方法通过为每个数据集应用凝视适应模块来纠正共享估计算法中单个视角的注视估计。这使我们能够结合数据集之间的信息,尽管存在标签不匹配的问题。我们的实验结果表明,这些创新方法既个人又集体地提高了注视估算器的性能(10% - 20%)。我们的代码可在此处访问:https://url.in/
https://arxiv.org/abs/2409.00912
Parsing of eye components (i.e. pupil, iris and sclera) is fundamental for eye tracking and gaze estimation for AR/VR products. Mainstream approaches tackle this problem as a multi-class segmentation task, providing only visible part of pupil/iris, other methods regress elliptical parameters using human-annotated full pupil/iris parameters. In this paper, we consider two priors: projected full pupil/iris circle can be modelled with ellipses (ellipse prior), and the visibility of pupil/iris is controlled by openness of eye-region (condition prior), and design a novel method CondSeg to estimate elliptical parameters of pupil/iris directly from segmentation labels, without explicitly annotating full ellipses, and use eye-region mask to control the visibility of estimated pupil/iris ellipses. Conditioned segmentation loss is used to optimize the parameters by transforming parameterized ellipses into pixel-wise soft masks in a differentiable way. Our method is tested on public datasets (OpenEDS-2019/-2020) and shows competitive results on segmentation metrics, and provides accurate elliptical parameters for further applications of eye tracking simultaneously.
解析眼睛部件(即瞳孔、虹膜和眼白)对于AR/VR产品中的跟踪和眼球运动估计是至关重要的。主流方法将这个问题视为多类分割任务,仅提供可见的瞳孔/虹膜部分,其他方法通过人标注的完整瞳孔/虹膜参数回归椭圆参数。在本文中,我们考虑两个假设:投影的完整瞳孔/虹膜圆可以建模为椭圆(椭圆假设),而瞳孔/虹膜的可见性由眼区域的开阔程度(条件假设)控制,我们设计了一种名为CondSeg的方法,可以直接从分割标签估计瞳孔/虹膜的椭圆参数,而无需明确标注完整的椭圆。我们使用眼区域掩码来控制估计瞳孔/虹膜椭圆的可见性。通过条件分割损失函数以通过变换参数化的椭圆为不同维度的软掩码来优化参数。我们的方法在公开数据集(OpenEDS-2019/-2020)上进行了测试,在分割指标上表现出竞争力的结果,并为同时应用于其他眼跟踪应用提供准确的椭圆参数。
https://arxiv.org/abs/2408.17231
The availability of extensive datasets containing gaze information for each subject has significantly enhanced gaze estimation accuracy. However, the discrepancy between domains severely affects a model's performance explicitly trained for a particular domain. In this paper, we propose the Causal Representation-Based Domain Generalization on Gaze Estimation (CauGE) framework designed based on the general principle of causal mechanisms, which is consistent with the domain difference. We employ an adversarial training manner and an additional penalizing term to extract domain-invariant features. After extracting features, we position the attention layer to make features sufficient for inferring the actual gaze. By leveraging these modules, CauGE ensures that the neural networks learn from representations that meet the causal mechanisms' general principles. By this, CauGE generalizes across domains by extracting domain-invariant features, and spurious correlations cannot influence the model. Our method achieves state-of-the-art performance in the domain generalization on gaze estimation benchmark.
大量数据集的可用性大大提高了 gaze 估计的准确性。然而,领域之间的差异对模型为特定领域进行训练的性能产生严重影响。在本文中,我们提出了基于因果机制的领域泛化在 gaze 估计(CaGE)框架,该框架基于因果机制的一般原则,与领域差异相一致。我们采用对抗训练方法和额外的惩罚项来提取域内特征。在提取特征后,我们将注意力层放置为使特征足够用于推断实际 gaze。通过利用这些模块,CaGE确保神经网络从满足因果机制一般原则的表示中学习。通过这种方式,CaGE在领域通用性上超越了模型。我们的方法在域通用性上实现了与最先进的 gaze 估计基准相当的表现。
https://arxiv.org/abs/2408.16964
Gaze estimation is pivotal in human scene comprehension tasks, particularly in medical diagnostic analysis. Eye-tracking technology facilitates the recording of physicians' ocular movements during image interpretation, thereby elucidating their visual attention patterns and information-processing strategies. In this paper, we initially define the context-aware gaze estimation problem in medical radiology report settings. To understand the attention allocation and cognitive behavior of radiologists during the medical image interpretation process, we propose a context-aware Gaze EstiMation (GEM) network that utilizes eye gaze data collected from radiologists to simulate their visual search behavior patterns throughout the image interpretation process. It consists of a context-awareness module, visual behavior graph construction, and visual behavior matching. Within the context-awareness module, we achieve intricate multimodal registration by establishing connections between medical reports and images. Subsequently, for a more accurate simulation of genuine visual search behavior patterns, we introduce a visual behavior graph structure, capturing such behavior through high-order relationships (edges) between gaze points (nodes). To maintain the authenticity of visual behavior, we devise a visual behavior-matching approach, adjusting the high-order relationships between them by matching the graph constructed from real and estimated gaze points. Extensive experiments on four publicly available datasets demonstrate the superiority of GEM over existing methods and its strong generalizability, which also provides a new direction for the effective utilization of diverse modalities in medical image interpretation and enhances the interpretability of models in the field of medical imaging. this https URL
目光估计在人体场景理解任务中具有重要地位,特别是在医学诊断分析中。眼跟踪技术通过记录医生在图像解释过程中眼球运动的数据,从而揭示了他们的视觉关注模式和信息处理策略。在本文中,我们首先定义了医学影像报告环境中具有上下文意识的目光估计问题。为了了解放射员在医学图像解释过程中分配注意力和认知行为,我们提出了一个具有上下文意识的目光估计(GEM)网络,该网络利用从放射员收集的眼部运动数据来模拟他们在图像解释过程中的视觉搜索行为模式。它包括上下文意识模块、视觉行为图构建和视觉行为匹配。在上下文意识模块内,我们通过建立医学报告和图像之间的连接,实现了复杂的多模态注册。接着,为了更准确地模拟真实视觉搜索行为模式,我们引入了视觉行为图结构,通过高阶关系(边)来捕捉这种行为。为了保持视觉行为的真实性,我们设计了一种视觉行为匹配方法,通过匹配真实和估计的眼部运动数据来调整它们之间的高阶关系。在四个公开可用的数据集上进行的大量实验证明,GEM 相对于现有方法具有优越性,并且具有很强的泛化能力,这为在医学图像解释中有效地利用各种模态提供了新的方向,同时也提高了模型的可解释性。这个链接:
https://arxiv.org/abs/2408.05502
Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze tracking, a critical enabler for many ubiquitous applications and has triggered the development of easy-to-use gaze estimation services. Indeed, by utilizing the ubiquitous cameras on tablets and smartphones, users can readily access many gaze estimation services. In using these services, users must provide their full-face images to the gaze estimator, which is often a black box. This poses significant privacy threats to the users, especially when a malicious service provider gathers a large collection of face images to classify sensitive user attributes. In this work, we present PrivateGaze, the first approach that can effectively preserve users' privacy in black-box gaze tracking services without compromising gaze estimation performance. Specifically, we proposed a novel framework to train a privacy preserver that converts full-face images into obfuscated counterparts, which are effective for gaze estimation while containing no privacy information. Evaluation on four datasets shows that the obfuscated image can protect users' private information, such as identity and gender, against unauthorized attribute classification. Meanwhile, when used directly by the black-box gaze estimator as inputs, the obfuscated images lead to comparable tracking performance to the conventional, unprotected full-face images.
眼神包含了关于人类注意力和认知过程的丰富信息。这种能力使得 gaze tracking 成为许多普遍应用的关键使能器,并引发了易于使用的眼神估计服务的开发。事实上,通过利用平板电脑和智能手机上的普遍摄像头,用户可以轻松访问许多眼神估计服务。在使用这些服务时,用户必须向眼神估计算提供完整的面容图像,通常是一个黑盒。这给用户带来了显著的隐私威胁,尤其是在恶意服务提供商收集大量面容图像以分类敏感用户属性时。 在本文中,我们提出了 PrivateGaze,是第一个在不影响眼神估计性能的同时有效保护用户隐私的黑盒 gaze tracking 服务的方法。具体来说,我们提出了一个新框架来训练一个隐私保护者,使其将完整面容图像转换为包含隐私信息但不影响眼神估计效果的遮罩图像。在四个数据集上的评估显示,遮罩图像可以保护用户的隐私信息,如身份和性别,免受未经授权的属性分类。同时,当被黑盒眼神估计算直接使用时,遮罩图像与传统未保护的完整面容图像的追踪性能相当。
https://arxiv.org/abs/2408.00950
Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.
基于外观的监督方法在最近的光注估计任务中取得了巨大的进展。然而,强烈的标注需求抑制了当前方法实现工业级精度和鲁棒性。尽管当前的无监督预训练框架在许多图像识别任务中取得了成功,但由于面部和眼睛特征之间的深入耦合,这些框架仍然缺乏从全脸上提取有用的目光特征。为了减轻上述限制,本文提出了一种新颖的无监督/自监督目光预训练框架,通过合作特征对比和挤压模块,迫使全脸分支学习一个低维度的目光嵌入,而无需 gaze 注释。这个框架的核心是一个交替的眼关注/不关注掩码训练计划,通过通过注入瓶颈设计将目光相关的信息从全脸分支压缩到眼掩码自编码器中,从而成功引导模型更加关注目光方向而非面部纹理,同时仍然采用眼自重建目标。同时,为了进一步提高学习到的表示,还设计了一个新颖的眼/目光相关信息的对比损失,强制模型将注意力集中在眼中心区域。在多个目光基准测试上进行的大量实验结果表明,与无监督状态下的最先进方法相比,所提出的方案具有卓越的性能。
https://arxiv.org/abs/2407.00315
This study presents a novel framework for 3D gaze tracking tailored for mixed-reality settings, aimed at enhancing joint attention and collaborative efforts in team-based scenarios. Conventional gaze tracking, often limited by monocular cameras and traditional eye-tracking apparatus, struggles with simultaneous data synchronization and analysis from multiple participants in group contexts. Our proposed framework leverages state-of-the-art computer vision and machine learning techniques to overcome these obstacles, enabling precise 3D gaze estimation without dependence on specialized hardware or complex data fusion. Utilizing facial recognition and deep learning, the framework achieves real-time, tracking of gaze patterns across several individuals, addressing common depth estimation errors, and ensuring spatial and identity consistency within the dataset. Empirical results demonstrate the accuracy and reliability of our method in group environments. This provides mechanisms for significant advances in behavior and interaction analysis in educational and professional training applications in dynamic and unstructured environments.
https://arxiv.org/abs/2406.11003
We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional "observer" device providing a third-person viewpoint. We compute world-aligned 6DoF transformations for all sensors, across devices and capture sessions. The dataset also provides 3D scene point clouds and calibrated gaze estimation. We derive a protocol to annotate hierarchical language descriptions of in-context human motion, from fine-grain pose narrations, to atomic actions and activity summarization. To the best of our knowledge, the Nymeria dataset is the world largest in-the-wild collection of human motion with natural and diverse activities; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world largest dataset with motion-language descriptions. It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km. The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset we define key research tasks for egocentric body tracking, motion synthesis, and action recognition and evaluate several state-of-the-art baseline algorithms. Data and code will be open-sourced.
https://arxiv.org/abs/2406.09905
We consider the problem of user-adaptive 3D gaze estimation. The performance of person-independent gaze estimation is limited due to interpersonal anatomical differences. Our goal is to provide a personalized gaze estimation model specifically adapted to a target user. Previous work on user-adaptive gaze estimation requires some labeled images of the target person data to fine-tune the model at test time. However, this can be unrealistic in real-world applications, since it is cumbersome for an end-user to provide labeled images. In addition, previous work requires the training data to have both gaze labels and person IDs. This data requirement makes it infeasible to use some of the available data. To tackle these challenges, this paper proposes a new problem called efficient label-free user adaptation in gaze estimation. Our model only needs a few unlabeled images of a target user for the model adaptation. During offline training, we have some labeled source data without person IDs and some unlabeled person-specific data. Our proposed method uses a meta-learning approach to learn how to adapt to a new user with only a few unlabeled images. Our key technical innovation is to use a generalization bound from domain adaptation to define the loss function in meta-learning, so that our method can effectively make use of both the labeled source data and the unlabeled person-specific data during training. Extensive experiments validate the effectiveness of our method on several challenging benchmarks.
https://arxiv.org/abs/2406.09481
Gaze-annotated facial data is crucial for training deep neural networks (DNNs) for gaze estimation. However, obtaining these data is labor-intensive and requires specialized equipment due to the challenge of accurately annotating the gaze direction of a subject. In this work, we present a generative framework to create annotated gaze data by leveraging the benefits of labeled and unlabeled data sources. We propose a Gaze-aware Compositional GAN that learns to generate annotated facial images from a limited labeled dataset. Then we transfer this model to an unlabeled data domain to take advantage of the diversity it provides. Experiments demonstrate our approach's effectiveness in generating within-domain image augmentations in the ETH-XGaze dataset and cross-domain augmentations in the CelebAMask-HQ dataset domain for gaze estimation DNN training. We also show additional applications of our work, which include facial image editing and gaze redirection.
目光注释的面部数据对于训练深度神经网络(DNNs)进行目光估计至关重要。然而,获得这些数据非常费力,需要专业的设备,因为准确标注受试者目光方向是一项具有挑战性的任务。在这项工作中,我们提出了一个基于已知和未标记数据源的生成框架,以利用已有的优势。我们提出了一种Gaze-aware Compositional GAN,可以从有限的标记数据集中学习生成带有注释的面部图像。然后,我们将这个模型转移到未标记数据领域,以利用它提供的多样性。实验结果表明,在我们的方法中,在ETH-XGaze数据集上生成 within-domain 图像增强和跨域增强,在CelebAMask-HQ数据集领域上用于目光估计DNN训练。我们还展示了我们工作的额外应用,包括面部图像编辑和目光指向。
https://arxiv.org/abs/2405.20643
Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data. Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features. These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models. Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach.
尽管在可见的进步中,主流的视差估计技术(特别是以外观为基础的方法)在未受控的环境中往往性能下降,因为照明和个体面部属性的变化会导致性能下降。现有的领域自适应策略,由于需要目标领域样本,可能在其现实应用中不够有效。本文介绍了一种名为Branch-out Auxiliary Regularization(BAR)的创新方法,旨在提高视差估计的泛化能力,而无需直接访问目标领域数据。具体来说,BAR结合了两个辅助一致性正则化分支:一个使用增强样本来对抗环境变化,另一个将目光方向与积极源域样本对齐,以促进学习一致的视差特征。这些辅助通道加强了核心网络,以一种平滑、可插拔的方式集成,便于轻松适应各种其他模型。在四个跨数据集任务的综合实验评估中,证明了我们的方法具有优越性。
https://arxiv.org/abs/2405.01439
Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by $2.5^\circ$ without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of $0.8^\circ$. The code and pre-trained models are available at \url{this https URL}.
凝视是人类行为和注意力的关键提示。最近,越来越多地关注从视频面部视频中确定凝视方向。然而,视频凝视估计面临着重大挑战,例如理解视频序列中凝视的动态演变,处理静态背景,以及适应光照变化。为了应对这些挑战,我们提出了一个简单而新颖的深度学习模型,旨在从视频中估计凝视,并包括一个专门的注意力模块。我们的方法采用了一种空间注意力机制,跟踪视频中的空间动态。这种技术通过时间序列模型准确预测凝视方向,从而显著提高了凝视估计精度。此外,我们的方法结合高斯过程,包括个人特征,从而通过仅几个标记样本实现模型的个性化。实验结果证实了所提出的方法的成效,表明其在数据集内和跨数据集设置中都取得了成功。具体来说,我们在Gaze360数据集上实现了最先进的性能,通过个性化模型提高了2.5度。此外,通过仅使用三个样本,我们实现了额外的0.8度改进。代码和预训练模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.05215
This paper tackles the problem of passive gaze estimation using both event and frame data. Considering inherently different physiological structures, it's intractable to accurately estimate purely based on a given state. Thus, we reformulate the gaze estimation as the quantification of state transitions from the current state to several prior registered anchor states. Technically, we propose a two-stage learning-based gaze estimation framework to divide the whole gaze estimation process into a coarse-to-fine process of anchor state selection and final gaze location. Moreover, to improve generalization ability, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion technique to iteratively remove inherent noise of event data. Extensive experiments demonstrate the effectiveness of the proposed method, which greatly surpasses state-of-the-art methods by a large extent of 15$\%$. The code will be publicly available at this https URL.
本论文通过同时考虑事件和帧数据,解决了被动眼动估计的问题。考虑到固有生理结构的不同,仅基于给定状态进行准确估计是不可能的。因此,我们将目光估计重新建模为从当前状态到几个已注册的先验锚定状态的状态转移量的量化。技术上,我们提出了一个基于两级学习的光注意力估计框架,将整个目光估计过程划分为锚定状态选择和最终眼动位置的粗细过程。此外,为了提高泛化能力,我们将一组局部专家与学生网络对齐,引入了一种新的去噪蒸馏算法,利用去噪扩散技术迭代地去除事件数据固有噪声。大量实验证明,所提出的方法的有效性超出了现有方法的很大程度,其性能提高到了15%以上。代码将在这个 https URL 上公开。
https://arxiv.org/abs/2404.00548
Driver's eye gaze holds a wealth of cognitive and intentional cues crucial for intelligent vehicles. Despite its significance, research on in-vehicle gaze estimation remains limited due to the scarcity of comprehensive and well-annotated datasets in real driving scenarios. In this paper, we present three novel elements to advance in-vehicle gaze research. Firstly, we introduce IVGaze, a pioneering dataset capturing in-vehicle gaze, collected from 125 subjects and covering a large range of gaze and head poses within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset, we propose a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges. Second, our research focuses on in-vehicle gaze estimation leveraging the IVGaze. In-vehicle face images often suffer from low resolution, prompting our introduction of a gaze pyramid transformer that leverages transformer-based multilevel features integration. Expanding upon this, we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation, we rotate virtual cameras to normalize images, utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR shows state-of-the-art performance on the IVGaze dataset. Thirdly, we explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from the projection points and visual attributes from images, we achieve superior performance compared to relying solely on visual features, substantiating the advantage of gaze estimation. Our project is available at https://yihua.zone/work/ivgaze.
驾驶员的视角持有大量的认知和意图线索,这对智能车辆至关重要。然而,由于实际驾驶场景中全面和标注数据的稀缺性,关于车内目光估计的研究仍然有限。在本文中,我们提出了三个新的元素来促进车内目光研究的发展。首先,我们介绍了IVGaze,一个开创性的数据集,收集了125个受试者的车内目光,涵盖了车辆内目光和头部的各种姿态。传统的目光收集系统不适用于车内使用。在这个数据集中,我们提出了一种新的基于视觉的解决方案来解决标注挑战,引入了精确的 gaze 目标校准方法来解决这一问题。其次,我们的研究关注利用IVGaze进行车内目光估计。由于车内面部图片往往分辨率较低,因此我们引入了一种基于变压器的 gaze 金字塔转变换器,利用基于变压器的多层特征集成。进一步,我们引入了双流 gaze 金字塔转变换器(GazeDPTR)。利用透视变换,我们将虚拟相机旋转来归一化图像,利用相机姿态来合并归一化和原始图像以实现准确的目标眼神估计。GazeDPTR在IVGaze数据集上表现出最先进的成绩。最后,我们探讨了通过扩展GazeDPTR来对目光区域进行分类的新策略。我们定义了一个基础的三维平面和将投影点的位置特征和图像的视觉特征合并。我们实现了比仅依赖视觉特征更卓越的表现,证实了眼神估计的优势。我们的项目可在https://yihua.zone/work/ivgaze上查看。
https://arxiv.org/abs/2403.15664
Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.
目光估计方法通常在评估时会受到不同领域的显著性能下降,因为测试和训练数据之间的领域差异。现有的方法尝试通过各种领域泛化方法来解决这个 issue,但效果仍然有限,因为目光数据集的多样性有限,如外观、可穿戴和图像质量等。为了克服这些限制,我们提出了一个名为 CLIP-Gaze 的新框架,它利用了一个预训练的视觉-语言模型来利用其可转移的知识。我们的框架是第一个利用视觉和语言跨模态方法来解决目光估计任务的。具体来说,我们通过将注意力相关的特征从与目光无关的特征中推开,从而实现了对语言描述可构建的灵活性。为了学习更合适的提示,我们提出了一个个性化的上下文优化方法来进行文本提示的调整。此外,我们还利用目光样本之间的关系来平滑目光相关特征的分布,从而提高 gaze 估计模型的泛化能力。大量实验证明,CLIP-Gaze 在四个跨领域评估中的表现超过了现有方法。
https://arxiv.org/abs/2403.05124
Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators' updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.
最新的目光估计方法需要大规模训练数据,但它们的收集和交换却存在着显著的隐私风险。我们提出PrivatEyes - 基于联邦学习和安全多方计算(MPC)的第一个隐私增强训练方法,用于基于外观的目光估计。PrivatEyes使多个局部数据集上的训练目光估计算法能够在不同的用户和服务器上进行训练,并对个人估计算法的更新进行安全聚合。PrivatEyes保证,即使大多数聚合服务器都是恶意的,个人目光数据也不会泄漏。我们还引入了一种新的数据泄露攻击DualView,证明了PrivatEyes比其他方法更有效地限制了训练数据的泄露。在MPIIGaze、MPIIFaceGaze、GazeCapture和NVGaze数据集上的评估进一步表明,提高隐私不会导致目光估计精度降低,或者导致计算成本大幅上升——这两者与非安全对照物的水平相当。
https://arxiv.org/abs/2402.18970