Paper Reading AI Learner

How is Visual Attention Influenced by Text Guidance? Database and Model

2024-04-11 08:03:23
Yinan Sun, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

Abstract

The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: this https URL.

Abstract (translated)

翻译 视觉注意力的分析和预测在计算机视觉和图像处理领域一直是关键任务。在实际应用中,图像通常会伴随各种文本描述,然而,很少有研究探讨文本描述对视觉注意力的影响,更不用说开发考虑文本指导的视觉显著性预测模型了。在本文中,我们全面研究了基于文本引导的图像显著性(TIS)的 both 主观 和 客观 方面。具体来说,我们构建了一个名为 SJTU-TIS 的 TIS 数据库,包括 1200 个文本-图像对及其相应的收集的眼动数据。基于建立的 SJTU-TIS 数据库,我们分析了各种文本描述对视觉注意力的影响。然后,为了促进考虑文本影响的发展,我们使用最先进的视觉显著性模型构建了基于建立的 SJTU-TIS 数据库的基准。最后,在考虑文本描述对视觉注意力影响的大多数现有视觉显著性模型忽略了这个影响的情况下,我们进一步提出了一个文本引导的视觉显著性(TGSal)预测模型,该模型提取和整合图像特征和文本特征,以在各种文本描述条件下预测图像的视觉显著性。我们提出的模型在 SJTU-TIS 数据库和纯图像显著性数据库上显著优于最先进的视觉显著性模型。SJTU-TIS 数据库和所提出的 TGSal 模型的代码将在此处发布:https://this URL。

URL

https://arxiv.org/abs/2404.07537

PDF

https://arxiv.org/pdf/2404.07537.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot