Abstract
The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: this https URL.
Abstract (translated)
翻译 视觉注意力的分析和预测在计算机视觉和图像处理领域一直是关键任务。在实际应用中,图像通常会伴随各种文本描述,然而,很少有研究探讨文本描述对视觉注意力的影响,更不用说开发考虑文本指导的视觉显著性预测模型了。在本文中,我们全面研究了基于文本引导的图像显著性(TIS)的 both 主观 和 客观 方面。具体来说,我们构建了一个名为 SJTU-TIS 的 TIS 数据库,包括 1200 个文本-图像对及其相应的收集的眼动数据。基于建立的 SJTU-TIS 数据库,我们分析了各种文本描述对视觉注意力的影响。然后,为了促进考虑文本影响的发展,我们使用最先进的视觉显著性模型构建了基于建立的 SJTU-TIS 数据库的基准。最后,在考虑文本描述对视觉注意力影响的大多数现有视觉显著性模型忽略了这个影响的情况下,我们进一步提出了一个文本引导的视觉显著性(TGSal)预测模型,该模型提取和整合图像特征和文本特征,以在各种文本描述条件下预测图像的视觉显著性。我们提出的模型在 SJTU-TIS 数据库和纯图像显著性数据库上显著优于最先进的视觉显著性模型。SJTU-TIS 数据库和所提出的 TGSal 模型的代码将在此处发布:https://this URL。
URL
https://arxiv.org/abs/2404.07537