Paper Reading AI Learner

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

2024-05-06 11:27:27
Li Mi, Xianjie Dai, Javiera Castillo-Navarro, Devis Tuia

Abstract

Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

Abstract (translated)

大地球观测档案中基于图像的检索具有挑战性,因为需要仅以查询图像为指南穿越数千个候选匹配。通过将文本作为支持视觉查询的信息,检索系统在可用性方面获得了提高,但同时由于视觉信号的多样性无法仅通过短文标题来总结,因此面临着困难。因此,作为一种匹配为基础的任务,跨模态文本-图像检索常常存在文本和图像之间的信息不对称。为了应对这一挑战,我们提出了一个知识引导的文本-图像检索(KTIR)方法来解决遥感图像。通过从外部知识图中挖掘相关信息,KTIR为搜索查询提供了更丰富的文本范围,并减轻了文本和图像之间的信息缺口,从而实现更好的匹配。此外,通过整合领域特定知识,KTIR还增强了预训练视觉语言模型对远程观测应用的适应性。在三个常用的遥感文本-图像检索基准测试中,与最先进的检索方法相比,所提出的知识引导方法产生了各种不同的检索结果,但均具有更好的表现。

URL

https://arxiv.org/abs/2405.03373

PDF

https://arxiv.org/pdf/2405.03373.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot