Paper Reading AI Learner

AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

2024-04-21 06:33:04
Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po

Abstract

Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt and ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down the parallel multi-branch depth-wise convolutions with descending scales of k x k kernels into a cascade of two multi-branch depth-wise convolutions. The first multi-branch consists of parallel multi-scale 1 x k depth-wise convolutional layers followed by a similar multi-branch employing parallel multi-scale k x 1 depth-wise convolutional layers. This reduces computational and memory footprint while separating time and frequency processing of Mel-Spectrograms. The large kernels capture global frequencies and long activities, while small kernels get local frequencies and short activities. We also reparameterize the multi-branch design during inference to further boost speed without losing accuracy. Experiments show that AudioRepInceptionNeXt reduces parameters and computations by 50%+ and improves inference speed 1.28x over state-of-the-art CNNs like the Slow-Fast while maintaining comparable accuracy. It also learns robustly across a variety of audio recognition tasks. Codes are available at this https URL.

Abstract (translated)

近年来,使用Mel-Spectrograms成功将基于视觉的卷积神经网络(CNN)架构应用于音频识别任务。然而,这些CNN具有较高的计算成本和内存需求,限制了它们在低端边缘设备上的部署。受到像InceptionNeXt和ConvNeXt等高效视觉模型的成功启发,我们提出了AudioRepInceptionNeXt,一种单流架构。其基本构建模块将具有逐级下降的k x k kernels的并行多分支深度卷积分解成两个并行的多分支深度卷积。第一个多分支包括逐级下降的1 x k深度卷积层,然后是一个类似的多分支,使用逐级下降的k x 1深度卷积层。这减少了计算和内存足迹,同时将Mel-Spectrogram的时域和频域处理分离。大kernels捕捉全局频率和长活动,而小kernels获取局部频率和短活动。在推理过程中,我们也对多分支设计进行了重新参数化,以进一步提高速度而不会失去准确性。实验证明,AudioRepInceptionNeXt比诸如Slow-Fast这样的最先进的CNN减少50%+的参数和计算,同时提高推理速度1.28倍,而在保持相当准确性的同时。它还能够在各种音频识别任务中稳健地学习。代码可在此处访问:https:// this URL.

URL

https://arxiv.org/abs/2404.13551

PDF

https://arxiv.org/pdf/2404.13551.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot