Paper Reading AI Learner

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

2024-04-16 15:02:46
Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, Jun Huan

Abstract

Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.

Abstract (translated)

将大型语言模型(LLMs)在下游任务上表现出色需要数万亿个标记的预训练。这通常需要大量的强大计算设备以及一个稳定的分布式训练框架来加速训练。越来越多应用利用AI/ML,导致昂贵的传统加速器(如GPUs)的数量有限,因此需要可扩展且高效的专用加速器。AWS Trainium是专门为训练大型深度学习模型而设计的第二代机器学习加速器。它的相应实例Amazon EC2 trn1是对GPU实例的一个替代,适用于LLM训练。然而,在trn1上训练数十亿参数的LLM具有挑战性,因为其软件生态系统相对较弱。在本文中,我们展示了HLAT:使用trn1实例对1.8万亿个标记的预训练LLM。HLAT的性能与 popular open source baseline models(包括 LLaMA 和 OpenLLaMA)进行了比较,这些模型分别使用NVIDIA GPUs和Google TPUs进行训练。在各种评估任务上,我们证明了HLAT与基线模型具有相同的质量。我们还分享了使用AWS Trainium的Neuron分布式训练库(NDTL)实现高效训练的最佳实践。我们的工作表明,AWS Trainium由NDTL驱动能够成功预训练具有高性能和成本效益的先进LLM模型。

URL

https://arxiv.org/abs/2404.10630

PDF

https://arxiv.org/pdf/2404.10630.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot