Abstract
The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets in various Transformer architectures, including ViT, Swin-T, GPT-2, and RoBERTa. Our experimental results demonstrate that early-bird tickets can be consistently found within the first few epochs of training or fine-tuning, enabling significant resource optimization without compromising performance. The pruned models obtained from early-bird tickets achieve comparable or even superior accuracy to their unpruned counterparts while substantially reducing memory usage. Furthermore, our comparative analysis highlights the generalizability of the early-bird ticket phenomenon across different Transformer models and tasks. This research contributes to the development of efficient training strategies for Transformer models, making them more accessible and resource-friendly. By leveraging early-bird tickets, practitioners can accelerate the progress of natural language processing and computer vision applications while reducing the computational burden associated with training Transformer models.
Abstract (translated)
Transformer模型的训练已经颠覆了自然语言处理和计算机视觉,但仍是一个资源密集和耗时的过程。本文研究了早期鸟票假设对优化Transformer模型的训练效率的适用性。我们提出了一种结合迭代修剪、遮罩距离计算和选择性重置的方法来识别各种Transformer架构中的早期鸟票。我们的实验结果表明,在训练或微调的早期几轮中,早期鸟票可以在各种Transformer架构中持续发现,从而实现显著的资源优化,同时不牺牲性能。通过早期鸟票获得的修剪模型在保持准确性的同时,大大减少了内存使用。此外,我们的比较分析强调了早期鸟票现象在不同Transformer模型和任务上的普遍性。这项研究为Transformer模型的有效训练策略的发展做出了贡献,使这些模型更加易于使用和资源友好。通过利用早期鸟票,实践者可以加速自然语言处理和计算机视觉应用的发展,同时降低训练Transformer模型的计算负担。
URL
https://arxiv.org/abs/2405.02353