Paper Reading AI Learner

TransGOP: Transformer-Based Gaze Object Prediction

2024-02-21 07:17:10
Binglu Wang, Chenxi Guo, Yang Jin, Haisheng Xia, Nian Liu

Abstract

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at this https URL.

Abstract (translated)

目光物体预测的目标是预测人类观看的对象的位置和类别。 previous gaze object prediction 使用基于CNN的对象检测器预测对象的位置。然而,我们发现基于Transformer的对象检测器对于密集场景中的对象具有更准确的预测位置的能力。此外,Transformer的远距离建模能力可以帮助建立人头和目光物体之间的关系,这对于GOP任务非常重要。因此,本文将Transformer引入目光物体预测领域,并提出了一个端到端的Transformer-based gaze物体预测方法,名为TransGOP。具体来说,TransGOP使用了一个标准的Transformer-based物体检测器来检测物体的位置,并在目光回归器中设计了一个基于Transformer的 gaze 自动编码器,以建立远距离目光关系。此外,为了提高目光热图回归,我们提出了一个物体到目光物体的交叉注意力机制,让目光自动编码器的查询从物体检测器中学习全局记忆位置知识。最后,为了使整个框架端到端训练,我们提出了一个Gaze Box损失,通过增强目光物体的 gaze 热图能量,共同优化物体检测器和目光回归器。在GOO-Synth和GOO-Real数据集上的大量实验证明,我们的TransGOP在所有曲目上都实现了最先进的性能,即物体检测、目光估计和目光物体预测。我们的代码将在此处https:// URL上可用。

URL

https://arxiv.org/abs/2402.13578

PDF

https://arxiv.org/pdf/2402.13578.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot