CRA5: Extreme Compression of ERA5 for Portable Global Climate and Weather Research via an Efficient Variational Transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

The advent of data-driven weather forecasting models, which learn from hundreds of terabytes (TB) of reanalysis data, has significantly advanced forecasting capabilities. However, the substantial costs associated with data storage and transmission present a major challenge for data providers and users, affecting resource-constrained researchers and limiting their accessibility to participate in AI-based meteorological research. To mitigate this issue, we introduce an efficient neural codec, the Variational Autoencoder Transformer (VAEformer), for extreme compression of climate data to significantly reduce data storage cost, making AI-based meteorological research portable to researchers. Our approach diverges from recent complex neural codecs by utilizing a low-complexity Auto-Encoder transformer. This encoder produces a quantized latent representation through variance inference, which reparameterizes the latent space as a Gaussian distribution. This method improves the estimation of distributions for cross-entropy coding. Extensive experiments demonstrate that our VAEformer outperforms existing state-of-the-art compression methods in the context of climate data. By applying our VAEformer, we compressed the most popular ERA5 climate dataset (226 TB) into a new dataset, CRA5 (0.7 TB). This translates to a compression ratio of over 300 while retaining the dataset's utility for accurate scientific analysis. Further, downstream experiments show that global weather forecasting models trained on the compact CRA5 dataset achieve forecasting accuracy comparable to the model trained on the original dataset. Code, the CRA5 dataset, and the pre-trained model are available at this https URL.

Abstract (translated)

数据驱动的天气预报模型的出现已经显著提高了预测能力。然而，与数据存储和传输相关的巨额成本使得数据提供商和用户面临重大挑战，限制了受AI驱动气象研究限制的研究人员参与。为了减轻这个问题，我们引入了高效的神经编码器，Variational Autoencoder Transformer（VAEformer），用于对气候数据的极端压缩，显著减少了数据存储成本，使基于AI的气象研究对研究人员来说具有便携性。我们的方法与最近复杂的神经编码器有所不同，因为它利用了低复杂度的自编码器变换器。这个编码器通过离散变量推断产生量化 latent 表示，重新参数化 latent 空间为高斯分布。这种方法改善了交叉熵编码的分布估计。大量的实验证明，在气候数据背景下，我们的VAEformer超越了现有最先进的压缩方法。通过应用我们的VAEformer，我们将最流行的ERA5气候数据集（226 TB）压缩到了新的数据集CRA5（0.7 TB）。这导致压缩比超过300，同时保留数据的准确科学分析用途。此外，下游实验证明，在紧凑的CRA5数据集上训练的全天气报预测模型具有与原数据集训练的模型相当的预测准确性。代码、CRA5数据集和预训练模型都可以在這個URL https:// URL上找到。

URL

https://arxiv.org/abs/2405.03376

PDF

https://arxiv.org/pdf/2405.03376.pdf

CRA5: Extreme Compression of ERA5 for Portable Global Climate and Weather Research via an Efficient Variational Transformer

Abstract

Abstract (translated)

URL

PDF Copy

PDF