Abstract
Fixed-point (FXP) inference has proven suitable for embedded devices with limited computational resources, and yet model training is continually performed in floating-point (FLP). FXP training has not been fully explored and the non-trivial conversion from FLP to FXP presents unavoidable performance drop. We propose a novel method to train and obtain FXP convolutional keyword-spotting (KWS) models. We combine our methodology with two quantization-aware-training (QAT) techniques - squashed weight distribution and absolute cosine regularization for model parameters, and propose techniques for extending QAT over transient variables, otherwise neglected by previous paradigms. Experimental results on the Google Speech Commands v2 dataset show that we can reduce model precision up to 4-bit with no loss in accuracy. Furthermore, on an in-house KWS dataset, we show that our 8-bit FXP-QAT models have a 4-6% improvement in relative false discovery rate at fixed false reject rate compared to full precision FLP models. During inference we argue that FXP-QAT eliminates q-format normalization and enables the use of low-bit accumulators while maximizing SIMD throughput to reduce user perceived latency. We demonstrate that we can reduce execution time by 68% without compromising KWS model's predictive performance or requiring model architectural changes. Our work provides novel findings that aid future research in this area and enable accurate and efficient models.
Abstract (translated)
固定点(FXP)推理已经被证明适用于具有有限计算资源嵌入设备,但模型训练仍然通常在浮点(FLP)上进行。FXP训练尚未完全探索,从FLP到FXP的的重大转换不可避免地会导致性能下降。我们提出了一种新的方法来训练和获得FXP卷积关键词定位(KWS)模型。我们结合了我们的方法和两个量化名称训练(QAT)技术——对模型参数的镇压权重分布和绝对余弦正则化,并提出了方法来扩展QAT对暂态变量,而以前的 paradigm 忽略了它们。在Google语音命令v2数据集上的实验结果显示,我们可以将模型精度降低到4位,而精度没有损失。此外,在一个我们的内部KWS数据集上,我们表明,我们的8位FXP-QAT模型在固定错误拒绝率下的相对错误发现率有4-6%的提高,与全精度FLP模型相比。在推理期间,我们指出FXP-QAT消除了q格式归约,并允许使用低位寄存器,同时最大限度地增加单指令多数据流吞吐量,以减少用户感知延迟。我们证明,我们可以减少执行时间68%,而不会牺牲KWS模型的预测性能或要求模型结构改变。我们的工作提供了新发现,有助于该领域的未来研究,并使准确高效的模型实现。
URL
https://arxiv.org/abs/2303.02284