Abstract
Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitoring. In the past, the Machine Learning (ML) methods for performing the task mainly used the backbones pretrained in the manner of supervised learning (SL). As Masked Image Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a better way for learning visual feature representation, it presents a new opportunity for improving ML performance on the scene classification task. This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31. Compared to the published benchmarks, we show that the MIM pretrained Vision Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1 accuracy) and that the MIM technique can learn better feature representation than the supervised learning counterparts (up to 5% on top 1 accuracy). Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve competitive performance as the specially designed yet complicated Transformer for Remote Sensing (TRS) framework. Our experiment results also provide a performance baseline for future studies.
Abstract (translated)
遥感场景分类在地质调查、石油勘探、 traffic管理、地震预测、野火监测和情报监测等方面具有重要的作用,因此被广泛研究。在过去,机器学习(ML)方法在执行该任务时主要使用监督学习(SL)的骨干。由于掩模图像建模(MIM)是一种自监督学习(SSL)技术,已经被证明是学习视觉特征表示更好的方法,因此它提供了改善机器学习场景分类性能的新机会。本研究旨在探索MIM训练的骨干在四个著名的分类数据集上的潜在能力: Merced、AID、NWPU-RESISC45和 optimal-31。与已发布的基准相比,我们表明,MIM训练的视觉变换器(ViTs)骨干比其他替代品表现出色(最高准确性的准确率高达18%)。此外,我们还表明,一般性的MIM训练的ViTs可以像专门为遥感(TRS)框架设计的复杂Transformer一样实现竞争性能。我们的实验结果还提供了未来研究的性能基准。
URL
https://arxiv.org/abs/2302.14256