Abstract
Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.
Abstract (translated)
大地球观测档案中基于图像的检索具有挑战性,因为需要仅以查询图像为指南穿越数千个候选匹配。通过将文本作为支持视觉查询的信息,检索系统在可用性方面获得了提高,但同时由于视觉信号的多样性无法仅通过短文标题来总结,因此面临着困难。因此,作为一种匹配为基础的任务,跨模态文本-图像检索常常存在文本和图像之间的信息不对称。为了应对这一挑战,我们提出了一个知识引导的文本-图像检索(KTIR)方法来解决遥感图像。通过从外部知识图中挖掘相关信息,KTIR为搜索查询提供了更丰富的文本范围,并减轻了文本和图像之间的信息缺口,从而实现更好的匹配。此外,通过整合领域特定知识,KTIR还增强了预训练视觉语言模型对远程观测应用的适应性。在三个常用的遥感文本-图像检索基准测试中,与最先进的检索方法相比,所提出的知识引导方法产生了各种不同的检索结果,但均具有更好的表现。
URL
https://arxiv.org/abs/2405.03373