Abstract
Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap.
Abstract (translated)
城市化挑战凸显了需要有效的卫星图像-文本检索方法快速访问具有地理语义信息的具体信息以支持城市应用的重要性。然而,现有的方法通常忽视了不同城市景观之间的显著领域差距,主要关注在单一领域提高检索性能。为解决这个问题,我们提出了UrbanCross,一个新的跨领域卫星图像-文本检索框架。UrbanCross利用三个国家高质量、跨领域的数据集,丰富具有地理标签的数据,强调领域多样性。它采用大型多模态模型(LMM)进行文本细化和分而治之模型(SAM)进行视觉增强,实现图像、段落和文本的细粒度对齐,检索性能提高了10%。此外,UrbanCross还引入了一个自适应课程为基础的来源采样和加权对抗跨领域微调模块,在各个领域逐渐增强适应性。大量实验证实了UrbanCross在检索和适应新城市环境方面的优越性,表明在没有领域适应机制的情况下,其版本检索性能提高了15%,有效地弥合了领域差距。
URL
https://arxiv.org/abs/2404.14241