Paper Reading AI Learner

DQRM: Deep Quantized Recommendation Models

2024-10-26 02:33:52
Yang Zhou, Zhen Dong, Ellick Chan, Dhiraj Kalamkar, Diana Marculescu, Kurt Keutzer

Abstract

Large-scale recommendation models are currently the dominant workload for many large Internet companies. These recommenders are characterized by massive embedding tables that are sparsely accessed by the index for user and item features. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. In this work, we propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM). The proposed framework makes inference more efficient on the cloud servers, explores the possibility of deploying powerful recommenders on smaller edge devices, and optimizes the workload of the communication overhead in distributed training under the data parallelism settings. Specifically, we show that quantization-aware training (QAT) can impose a strong regularization effect to mitigate the severe overfitting issues suffered by DLRMs. Consequently, we achieved INT4 quantization of DLRM models without any accuracy drop. We further propose two techniques that improve and accelerate the conventional QAT workload specifically for the embedding tables in the recommendation models. Furthermore, to achieve efficient training, we quantize the gradients of the embedding tables into INT8 on top of the well-supported specified sparsification. We show that combining gradient sparsification and quantization together significantly reduces the amount of communication. Briefly, DQRM models with INT4 can achieve 79.07% accuracy on Kaggle with 0.27 GB model size, and 81.21% accuracy on the Terabyte dataset with 1.57 GB, which even outperform FP32 DLRMs that have much larger model sizes (2.16 GB on Kaggle and 12.58 on Terabyte).

Abstract (translated)

大规模推荐模型目前是许多大型互联网公司主要的工作负载。这些推荐器的特点是由索引稀疏访问的用户和项目特征的大规模嵌入表。这些超过1TB大小的表对推荐模型的训练和推理造成了严重的内存瓶颈。在这项工作中,我们提出了一种新颖的推荐框架,它基于最先进的深度学习推荐模型(DLRM),小而强大且运行与训练高效。所提出的框架使云服务器上的推断更加高效,探索了在较小边缘设备上部署强大推荐器的可能性,并优化了数据并行设置下分布式训练中的通信开销工作负载。具体来说,我们展示了量化感知训练(QAT)可以施加强有力的正则化效果以缓解DLRM遭受的严重过拟合问题。因此,我们在不降低任何准确性的前提下实现了DLRM模型的INT4量化。此外,我们提出了两种改进和加速传统QAT工作负载的技术,特别是针对推荐模型中的嵌入表。为了实现高效的训练,我们在支持良好的指定稀疏化的基础上将嵌入表的梯度量化为INT8。我们证明了结合梯度稀疏化与量化可以显著减少通信量。简而言之,具有INT4的DQRM模型在Kaggle上可达到79.07%的准确率和0.27GB的模型大小,并且在Terabyte数据集上达到81.21%的准确率和1.57GB的模型大小。这些性能甚至超过了具有更大模型尺寸(Kaggle为2.16 GB,Terabyte为12.58 GB)的FP32 DLRMs。

URL

https://arxiv.org/abs/2410.20046

PDF

https://arxiv.org/pdf/2410.20046.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot