PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning

Abstract
Abstract (translated)
URL
PDF

Abstract

Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.

Abstract (translated)

遥感图像-文本检索是遥感解释任务的基础，促进了视觉和语言表示的对齐。本文介绍了一种利用先验知识指导自适应学习视觉和文本表示的PIR学习范式。基于PIR，设计了一个适用于视觉-语言理解的域适应遥感图像-文本检索框架PIR-ITR，以解决视觉-语言理解任务中的语义噪声问题。然而，在预先训练视觉-语言基础模型时添加大量数据后，遥感图像-文本检索进一步发展成为开放域检索任务。继续上述，我们提出了PIR-CLIP，一个针对遥感图像-文本检索的域特定CLIP框架，以解决遥感视觉-语言表示中的语义噪声，进一步提高开放域检索性能。在视觉表示中，基于空间-PAE的视觉指令表示（VIR）利用先验指导下的遥感场景识别知识构建信念矩阵，以选择关键特征来降低语义噪声的影响。在文本表示中，基于Temporal-PAE的语义循环关注（LCA）利用先验指导下的前一个时间步循环激活当前时间步，以增强文本表示能力。我们提出了一种聚类局域关联损失（AL）来约束跨类别关系，并减小共轭空间的语义混淆区域。全面的实验证明，PIR可以增强视觉和文本表示，并在两个基准数据集RSICD和RSITMD上优于最先进的关闭域和开放域检索方法。

URL

https://arxiv.org/abs/2405.10160

PDF

https://arxiv.org/pdf/2405.10160.pdf

PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning

Abstract

Abstract (translated)

URL

PDF Copy

PDF