Paper Reading AI Learner

Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment

2024-05-02 05:09:07
Aditya Chakravarty

Abstract

Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].

Abstract (translated)

近年来,基于Transformer的自动语音识别(ASR)模型已经实现了词错误率(WER)低于4%,超过了人类注释者的工作准确率,然而它们需要大量的服务器资源,导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题,以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下,在设备级(边缘)ASR上,通过有效平衡能源消耗和准确性,提高了隐私,增强了性能,促进了可持续性。 本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度,我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现,将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗,同时性能降幅很小。模型大小和参数数量越大,并不能保证对噪声的鲁棒性,也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解,对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。

URL

https://arxiv.org/abs/2405.01004

PDF

https://arxiv.org/pdf/2405.01004.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot