Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].

Abstract (translated)

近年来，基于Transformer的自动语音识别（ASR）模型已经实现了词错误率（WER）低于4%，超过了人类注释者的工作准确率，然而它们需要大量的服务器资源，导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题，以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下，在设备级（边缘）ASR上，通过有效平衡能源消耗和准确性，提高了隐私，增强了性能，促进了可持续性。本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度，我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现，将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗，同时性能降幅很小。模型大小和参数数量越大，并不能保证对噪声的鲁棒性，也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解，对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。

URL

https://arxiv.org/abs/2405.01004

PDF

https://arxiv.org/pdf/2405.01004.pdf

Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment

Abstract

Abstract (translated)

URL

PDF Copy

PDF