Paper Reading AI Learner

'It is okay to be uncommon': Quantizing Sound Event Detection Networks on Hardware Accelerators with Uncommon Sub-Byte Support

2024-04-05 20:08:43
Yushu Wu, Xiao Quan, Mohammad Rasool Izadi, Chuan-Che Huang

Abstract

If our noise-canceling headphones can understand our audio environments, they can then inform us of important sound events, tune equalization based on the types of content we listen to, and dynamically adjust noise cancellation parameters based on audio scenes to further reduce distraction. However, running multiple audio understanding models on headphones with a limited energy budget and on-chip memory remains a challenging task. In this work, we identify a new class of neural network accelerators (e.g., NE16 on GAP9) that allows network weights to be quantized to different common (e.g., 8 bits) and uncommon bit-widths (e.g., 3 bits). We then applied a differentiable neural architecture search to search over the optimal bit-widths of a network on two different sound event detection tasks with potentially different requirements on quantization and prediction granularity (i.e., classification vs. embeddings for few-shot learning). We further evaluated our quantized models on actual hardware, showing that we reduce memory usage, inference latency, and energy consumption by an average of 62%, 46%, and 61% respectively compared to 8-bit models while maintaining floating point performance. Our work sheds light on the benefits of such accelerators on sound event detection tasks when combined with an appropriate search method.

Abstract (translated)

如果我们的消噪音耳机可以理解我们的音频环境,它们就可以告诉我们重要的事件声音,根据我们听的内容调整均衡,并根据音频场景动态调整降噪参数,从而进一步减少干扰。然而,在有限能源预算和芯片内存储器的耳机上运行多个音频理解模型仍然具有挑战性。在这项工作中,我们识别出一种新的神经网络加速器(例如,GAP9上的NE16)允许网络权重以不同的常见(例如8位)和罕见位宽(例如3位)进行量化。然后,我们应用了不同的神经网络架构搜索来搜索在两个不同的音频事件检测任务上的网络的最佳位宽。我们还进一步评估了我们的量化模型在实际硬件上的效果,结果表明,与8位模型相比,我们平均降低了62%、46%和61%的内存使用量、推理延迟和能耗。我们的工作揭示了在结合适当的搜索方法时,为声音事件检测任务提供这种加速器的益处。

URL

https://arxiv.org/abs/2404.04386

PDF

https://arxiv.org/pdf/2404.04386.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot