Paper Reading AI Learner

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

2024-05-04 23:16:48
Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.

Abstract (translated)

未标记的地球观测(EO)数据的体积巨大,但许多重要的应用缺乏标记的训练数据。然而,EO数据为根据地理位置和时间自动对不同模式和传感器数据进行对齐提供了独特的机会,几乎不需要任何人类劳动成本。我们抓住了这个机会,在全球范围内创建了一个多样化的多模态预训练数据集。利用这个包含1200万个位置的新数据集,我们提出了一个多预文本掩码自编码器(MP-MAE)方法,用于学习光学卫星图像的通用表示。我们的方法基于ConvNeXt V2架构,这是一个完全卷积掩码自编码器(MAE)。利用一系列多模态预文本任务,我们证明了我们的MP-MAE方法在ImageNet上预训练的MAEs和预训练在领域特定卫星图像上的MAEs之间都表现优异。这在几个下游任务中得到了证实,包括图像分类和语义分割。我们发现,多模态预训练显著提高了线性探测性能,例如在BigEarthNet上的4pp和So2Sat上的16pp,而仅预训练在光学卫星图像上。我们还证明了这还导致了更好的标签和参数效率,这是在全局应用中至关重要的一些方面。

URL

https://arxiv.org/abs/2405.02771

PDF

https://arxiv.org/pdf/2405.02771.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot