Abstract
It is critical to deploy complicated neural network models on hardware with limited resources. This paper proposes a novel model quantization method, named the Low-Cost Proxy-Based Adaptive Mixed-Precision Model Quantization (LCPAQ), which contains three key modules. The hardware-aware module is designed by considering the hardware limitations, while an adaptive mixed-precision quantization module is developed to evaluate the quantization sensitivity by using the Hessian matrix and Pareto frontier techniques. Integer linear programming is used to fine-tune the quantization across different layers. Then the low-cost proxy neural architecture search module efficiently explores the ideal quantization hyperparameters. Experiments on the ImageNet demonstrate that the proposed LCPAQ achieves comparable or superior quantization accuracy to existing mixed-precision models. Notably, LCPAQ achieves 1/200 of the search time compared with existing methods, which provides a shortcut in practical quantization use for resource-limited devices.
Abstract (translated)
在资源有限的情况下部署复杂的神性网络模型至关重要。本文提出了一种名为低成本代理基于适应混合精度模型量化(LCPAQ)的新模型量化方法,包含三个关键模块。硬件感知模块考虑到硬件限制,而自适应混合精度量化模块通过使用Hessian矩阵和帕累托前沿技术来评估量化敏感性。使用整数线性规划在不同的层之间微调量化。然后,低成本代理神经架构搜索模块有效地探索理想的量化超参数。在ImageNet上的实验表明,与现有混合精度模型相比,LCPAQ具有可比较或更好的量化精度。值得注意的是,LCPAQ比现有方法减少了1/200的搜索时间,为资源受限设备提供了一个实用的量化使用捷径。
URL
https://arxiv.org/abs/2402.17706