Abstract
Large-scale recommendation models are currently the dominant workload for many large Internet companies. These recommenders are characterized by massive embedding tables that are sparsely accessed by the index for user and item features. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. In this work, we propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM). The proposed framework makes inference more efficient on the cloud servers, explores the possibility of deploying powerful recommenders on smaller edge devices, and optimizes the workload of the communication overhead in distributed training under the data parallelism settings. Specifically, we show that quantization-aware training (QAT) can impose a strong regularization effect to mitigate the severe overfitting issues suffered by DLRMs. Consequently, we achieved INT4 quantization of DLRM models without any accuracy drop. We further propose two techniques that improve and accelerate the conventional QAT workload specifically for the embedding tables in the recommendation models. Furthermore, to achieve efficient training, we quantize the gradients of the embedding tables into INT8 on top of the well-supported specified sparsification. We show that combining gradient sparsification and quantization together significantly reduces the amount of communication. Briefly, DQRM models with INT4 can achieve 79.07% accuracy on Kaggle with 0.27 GB model size, and 81.21% accuracy on the Terabyte dataset with 1.57 GB, which even outperform FP32 DLRMs that have much larger model sizes (2.16 GB on Kaggle and 12.58 on Terabyte).
Abstract (translated)
大规模推荐模型目前是许多大型互联网公司主要的工作负载。这些推荐器的特点是由索引稀疏访问的用户和项目特征的大规模嵌入表。这些超过1TB大小的表对推荐模型的训练和推理造成了严重的内存瓶颈。在这项工作中,我们提出了一种新颖的推荐框架,它基于最先进的深度学习推荐模型(DLRM),小而强大且运行与训练高效。所提出的框架使云服务器上的推断更加高效,探索了在较小边缘设备上部署强大推荐器的可能性,并优化了数据并行设置下分布式训练中的通信开销工作负载。具体来说,我们展示了量化感知训练(QAT)可以施加强有力的正则化效果以缓解DLRM遭受的严重过拟合问题。因此,我们在不降低任何准确性的前提下实现了DLRM模型的INT4量化。此外,我们提出了两种改进和加速传统QAT工作负载的技术,特别是针对推荐模型中的嵌入表。为了实现高效的训练,我们在支持良好的指定稀疏化的基础上将嵌入表的梯度量化为INT8。我们证明了结合梯度稀疏化与量化可以显著减少通信量。简而言之,具有INT4的DQRM模型在Kaggle上可达到79.07%的准确率和0.27GB的模型大小,并且在Terabyte数据集上达到81.21%的准确率和1.57GB的模型大小。这些性能甚至超过了具有更大模型尺寸(Kaggle为2.16 GB,Terabyte为12.58 GB)的FP32 DLRMs。
URL
https://arxiv.org/abs/2410.20046