Paper Reading AI Learner

'Give Me BF16 or Give Me Death'? Accuracy-Performance Trade-Offs in LLM Quantization

2024-11-04 18:21:59
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

Abstract

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Abstract (translated)

尽管大语言模型(LLM)量化在推理加速方面广受欢迎,但与各种量化格式相关的准确性和性能权衡仍存在重大不确定性。我们对量化后的准确性进行了全面的实证研究,评估了流行的量化格式(FP8、INT8、INT4)在学术基准和实际任务上的表现,并在整个Llama-3.1模型系列上进行了测试。此外,我们的研究还考察了量化模型生成的文本与未压缩版本之间的差异。除了基准测试之外,我们还提出了一些量化的改进措施,使我们能够获得最先进的准确性恢复结果。通过超过50万次独立评估,我们的调查得出几个关键发现:(1) FP8权重和激活量化(W8A8-FP)在所有模型规模上都是无损的;(2) 当适当调整时,INT8权重和激活量化(W8A8-INT)只导致1-3%的准确度下降,这一数值令人惊讶地低;(3) INT4仅限于权重的量化(W4A16-INT)与8位整数权重和激活量化相当。为了解决给定部署环境中“最佳”格式的问题,我们使用流行的开源vLLM框架对各种GPU架构进行了推理性能分析。我们发现,在同步部署中,W4A16提供了最佳的成本效益;对于在中端GPU上进行异步部署也是如此。同时,W8A8格式在高端GPU上的中型和大型模型的异步“连续批处理”部署中表现出色。我们的结果为跨规模和性能要求部署量化LLM提供了一套实用指南。

URL

https://arxiv.org/abs/2411.02355

PDF

https://arxiv.org/pdf/2411.02355.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot