Abstract
Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.
Abstract (translated)
在边缘计算和物联网环境中,为了适应不同时间点上计算资源的变化,需要采用具备可调适减少策略的动态架构来管理基础语音模型。一种新兴的方法是层丢弃(Layer Dropping, $\mathcal{LD}$),该方法通过在推理过程中跳过骨干网络的一部分层次来降低计算负担,从而使静态模型能够转变为动态模型。然而,现有方法在选择层的方式上存在局限性,或者需要显著修改神经架构。 为此,我们提出了一种输入驱动的层丢弃(Input-driven $\mathcal{LD}$)策略,这种方法利用网络的输入特征和一个轻量级的层选择网络来确定最佳处理层次组合。我们在四个公共语音和音频基准测试集上进行了广泛的实验,并使用两种不同的预训练基础模型,证明了我们方法的有效性。我们的方法显著优于随机丢弃的方法,并且在早期退出策略上的表现相当(或更好)。
URL
https://arxiv.org/abs/2507.07954