Pretraining Billion-scale Geospatial Foundational Models on Frontier

Abstract
Abstract (translated)
URL
PDF

Abstract

As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.

Abstract (translated)

随着AI工作负载的增加，对于小任务特定模型的泛化能力变得具有挑战性，同时它们对大量标注训练样本的需求也在增加。相反，通过自监督学习使用互联网规模的无标注数据训练的基模型（FMs）已经证明了在各种任务上具有最小的微调适应性。尽管大型FM在自然语言处理和计算机视觉方面取得了显著影响，但针对地理应用的FM努力主要限制在小模型上，因为使用大模型进行预训练需要配备最先进的硬件加速器所需的大规模计算资源。目前，卫星星座每天收集100+TB的数据，导致图像具有数十亿个像素和高多模态性质。这样的地理空间数据带来了独特的挑战，同时也为开发FM提供了新的机会。我们通过在公共数据上预训练来研究亿规模FM和HPC训练剖面，用于地理应用。我们研究了从端到端解决方案的性能和影响，通过扩展模型大小进行缩放。我们的较大3B参数尺寸模型在比较100M参数模型时，实现多达30%的Top1场景分类精度提升。此外，我们在美国第一台每秒千万亿次浮点运算超级计算机（Frontier）上详细研究了使用PyTorch的全分片数据并行库进行不同模型和数据并行方法的研究。具体来说，我们研究了Vision Transformer架构（ViT）的变体，对大小达到15B参数的ViT模型进行了性能分析。通过讨论不同并行配置下的吞吐量和性能瓶颈，我们提供了关于如何利用这些领导级HPC资源在为地理影像应用开发大型模型时如何充分利用它们的见解。

URL

https://arxiv.org/abs/2404.11706

PDF

https://arxiv.org/pdf/2404.11706.pdf

Pretraining Billion-scale Geospatial Foundational Models on Frontier

Abstract

Abstract (translated)

URL

PDF Copy

PDF