Abstract
Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.
Abstract (translated)
遥感图像-文本检索是遥感解释任务的基础,促进了视觉和语言表示的对齐。本文介绍了一种利用先验知识指导自适应学习视觉和文本表示的PIR学习范式。基于PIR,设计了一个适用于视觉-语言理解的域适应遥感图像-文本检索框架PIR-ITR,以解决视觉-语言理解任务中的语义噪声问题。然而,在预先训练视觉-语言基础模型时添加大量数据后,遥感图像-文本检索进一步发展成为开放域检索任务。继续上述,我们提出了PIR-CLIP,一个针对遥感图像-文本检索的域特定CLIP框架,以解决遥感视觉-语言表示中的语义噪声,进一步提高开放域检索性能。在视觉表示中,基于空间-PAE的视觉指令表示(VIR)利用先验指导下的遥感场景识别知识构建信念矩阵,以选择关键特征来降低语义噪声的影响。在文本表示中,基于Temporal-PAE的语义循环关注(LCA)利用先验指导下的前一个时间步循环激活当前时间步,以增强文本表示能力。我们提出了一种聚类局域关联损失(AL)来约束跨类别关系,并减小共轭空间的语义混淆区域。全面的实验证明,PIR可以增强视觉和文本表示,并在两个基准数据集RSICD和RSITMD上优于最先进的关闭域和开放域检索方法。
URL
https://arxiv.org/abs/2405.10160