Abstract
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.
Abstract (translated)
大规模语言模型(LLMs) revolutionized 自然语言处理(NLP),但它们的大小产生了计算瓶颈。我们提出了一种新方法来创建准确、稀疏的基本大语言模型,在稀疏度达到 70% 时实现对微调任务的完全准确性恢复。我们通过将 SparseGPT 一键修剪方法和 SlimPajama 数据集中的稀疏预训练方法相结合,在 LaMA-2 7B 模型上实现了这一目标。我们在 Cerebras CS-3 芯片上展示了由于稀疏度而产生的训练加速,这个加速与理论上的扩展速度非常接近。此外,我们还通过利用 Neural Magic 的 DeepSparse 引擎在 CPU 上实现 up to 3x 的推理加速,而在 GPU 上实现同样的加速需要 Neural Magic 的 nm-vllm 引擎,通过稀疏度实现上述增长。这些增长是通过稀疏度实现的,因此可以通过进一步的量化实现更多的增长。具体来说,我们在 CPU 上实现了稀疏量化 LaMA-2 模型总共的 8.6x 速度提升。我们在各种具有挑战性的任务中展示了这些结果,包括聊天、指令跟随、代码生成、算术推理和总结,以证明其普适性。这项工作为快速创建小而快速的 LLM 奠定了基础,同时不牺牲准确性。
URL
https://arxiv.org/abs/2405.03594