Abstract
Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at this https URL.
Abstract (translated)
目光物体预测的目标是预测人类观看的对象的位置和类别。 previous gaze object prediction 使用基于CNN的对象检测器预测对象的位置。然而,我们发现基于Transformer的对象检测器对于密集场景中的对象具有更准确的预测位置的能力。此外,Transformer的远距离建模能力可以帮助建立人头和目光物体之间的关系,这对于GOP任务非常重要。因此,本文将Transformer引入目光物体预测领域,并提出了一个端到端的Transformer-based gaze物体预测方法,名为TransGOP。具体来说,TransGOP使用了一个标准的Transformer-based物体检测器来检测物体的位置,并在目光回归器中设计了一个基于Transformer的 gaze 自动编码器,以建立远距离目光关系。此外,为了提高目光热图回归,我们提出了一个物体到目光物体的交叉注意力机制,让目光自动编码器的查询从物体检测器中学习全局记忆位置知识。最后,为了使整个框架端到端训练,我们提出了一个Gaze Box损失,通过增强目光物体的 gaze 热图能量,共同优化物体检测器和目光回归器。在GOO-Synth和GOO-Real数据集上的大量实验证明,我们的TransGOP在所有曲目上都实现了最先进的性能,即物体检测、目光估计和目光物体预测。我们的代码将在此处https:// URL上可用。
URL
https://arxiv.org/abs/2402.13578