Abstract
The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
Abstract (translated)
神经网络架构、量化精度和硬件加速器的协同设计为在性能与效率之间实现最佳平衡提供了一种有前景的方法,尤其是在资源受限的边缘设备上部署模型时。在这项工作中,我们提出了JAQ框架(Joint Architecture, Quantization and Accelerator Framework),它共同优化这三个关键维度。然而,在处理这三个维度的巨大搜索空间时,自动化设计过程面临着重大挑战,尤其是当追求极低比特量化时更是如此。具体来说,主要的挑战包括: 1. 软件端内存开销:低精度量化的感知训练可能会导致由于存储大量中间特征和隐式权重以进行反向传播而产生显著的记忆使用问题,这可能引起内存耗尽。 2. 硬件端搜索时间长:硬件参数的离散性质以及编译器优化与个别操作之间的复杂相互作用使得加速器的搜索过程非常耗费时间。 为了解决这些问题,JAQ通过通道稀疏量化(Channel-wise Sparse Quantization, CSQ)方案缓解了内存开销问题。这种方法在优化过程中有选择地将量化应用于模型中最敏感的部分。此外,JAQ设计了BatchTile机制,该机制利用硬件生成网络来编码所有可能的切片模式,从而加速最优编译器映射策略的搜索过程。 广泛的实验展示了JAQ的有效性,在ImageNet数据集上实现了比先前方法高约7%的Top-1准确率,并将每次迭代中硬件搜索的时间减少到了0.15秒。
URL
https://arxiv.org/abs/2501.05339