Abstract
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. Pretraining is an active research topic, encompassing supervised and self-supervised learning methods to initialize model weights effectively. However, transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. In this study, we explore the Multi-Task Pretraining (MTP) paradigm for RS foundation models to address this issue. Using a shared encoder and task-specific decoder architecture, we conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. MTP supports both convolutional neural networks and vision transformer foundation models with over 300 million parameters. The pretrained models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection. Extensive experiments across 14 datasets demonstrate the superiority of our models over existing ones of similar size and their competitive performance compared to larger state-of-the-art models, thus validating the effectiveness of MTP.
Abstract (translated)
基于先验模型的 Remote Sensing (RS) 领域已经发生了变革,通过增强各种图像解释任务,使先验模型成为了 Remote Sensing 领域的一种重要工具。预训练是一个活跃的研究课题,涵盖了监督学习和自监督学习方法,以有效地初始化模型权重。然而,将预训练模型应用于下游任务可能会因为它们将预训练建模为图像分类或目标识别任务而遇到任务差异。在这项研究中,我们探讨了 Multi-Task Pretraining (MTP) 范式,以解决这一问题。我们使用共享编码器和支持特定任务解码器的设计,在 SAMRS 数据集上进行多任务监督预训练,包括语义分割、实例分割和旋转物体检测。MTP 支持超过 300 亿参数的卷积神经网络和视觉 Transformer 基础模型。预训练模型在各种 RS 下游任务上进行微调,例如场景分类、水平物体检测、语义分割和变化检测。在 14 个数据集上的大量实验证明,我们的模型在大小相似的情况下优于现有模型,并且与更大的先进模型具有竞争力的性能,从而验证了 MTP 的有效性。
URL
https://arxiv.org/abs/2403.13430