HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

Abstract
Abstract (translated)
URL
PDF

Abstract

Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.

Abstract (translated)

将大型语言模型（LLMs）在下游任务上表现出色需要数万亿个标记的预训练。这通常需要大量的强大计算设备以及一个稳定的分布式训练框架来加速训练。越来越多应用利用AI/ML，导致昂贵的传统加速器（如GPUs）的数量有限，因此需要可扩展且高效的专用加速器。AWS Trainium是专门为训练大型深度学习模型而设计的第二代机器学习加速器。它的相应实例Amazon EC2 trn1是对GPU实例的一个替代，适用于LLM训练。然而，在trn1上训练数十亿参数的LLM具有挑战性，因为其软件生态系统相对较弱。在本文中，我们展示了HLAT：使用trn1实例对1.8万亿个标记的预训练LLM。HLAT的性能与 popular open source baseline models（包括 LLaMA 和 OpenLLaMA）进行了比较，这些模型分别使用NVIDIA GPUs和Google TPUs进行训练。在各种评估任务上，我们证明了HLAT与基线模型具有相同的质量。我们还分享了使用AWS Trainium的Neuron分布式训练库（NDTL）实现高效训练的最佳实践。我们的工作表明，AWS Trainium由NDTL驱动能够成功预训练具有高性能和成本效益的先进LLM模型。

URL

https://arxiv.org/abs/2404.10630

PDF

https://arxiv.org/pdf/2404.10630.pdf

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

Abstract

Abstract (translated)

URL

PDF Copy

PDF