Paper Reading AI Learner

Input Conditioned Layer Dropping in Speech Foundation Models

2025-07-10 17:39:03
Abdul Hannan, Daniele Falavigna, Alessio Brutti

Abstract

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.

Abstract (translated)

在边缘计算和物联网环境中,为了适应不同时间点上计算资源的变化,需要采用具备可调适减少策略的动态架构来管理基础语音模型。一种新兴的方法是层丢弃(Layer Dropping, $\mathcal{LD}$),该方法通过在推理过程中跳过骨干网络的一部分层次来降低计算负担,从而使静态模型能够转变为动态模型。然而,现有方法在选择层的方式上存在局限性,或者需要显著修改神经架构。 为此,我们提出了一种输入驱动的层丢弃(Input-driven $\mathcal{LD}$)策略,这种方法利用网络的输入特征和一个轻量级的层选择网络来确定最佳处理层次组合。我们在四个公共语音和音频基准测试集上进行了广泛的实验,并使用两种不同的预训练基础模型,证明了我们方法的有效性。我们的方法显著优于随机丢弃的方法,并且在早期退出策略上的表现相当(或更好)。

URL

https://arxiv.org/abs/2507.07954

PDF

https://arxiv.org/pdf/2507.07954.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot