Paper Reading AI Learner

PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning

2024-05-16 14:53:45
Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai, Shengyong Chen

Abstract

Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.

Abstract (translated)

遥感图像-文本检索是遥感解释任务的基础,促进了视觉和语言表示的对齐。本文介绍了一种利用先验知识指导自适应学习视觉和文本表示的PIR学习范式。基于PIR,设计了一个适用于视觉-语言理解的域适应遥感图像-文本检索框架PIR-ITR,以解决视觉-语言理解任务中的语义噪声问题。然而,在预先训练视觉-语言基础模型时添加大量数据后,遥感图像-文本检索进一步发展成为开放域检索任务。继续上述,我们提出了PIR-CLIP,一个针对遥感图像-文本检索的域特定CLIP框架,以解决遥感视觉-语言表示中的语义噪声,进一步提高开放域检索性能。在视觉表示中,基于空间-PAE的视觉指令表示(VIR)利用先验指导下的遥感场景识别知识构建信念矩阵,以选择关键特征来降低语义噪声的影响。在文本表示中,基于Temporal-PAE的语义循环关注(LCA)利用先验指导下的前一个时间步循环激活当前时间步,以增强文本表示能力。我们提出了一种聚类局域关联损失(AL)来约束跨类别关系,并减小共轭空间的语义混淆区域。全面的实验证明,PIR可以增强视觉和文本表示,并在两个基准数据集RSICD和RSITMD上优于最先进的关闭域和开放域检索方法。

URL

https://arxiv.org/abs/2405.10160

PDF

https://arxiv.org/pdf/2405.10160.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot