Abstract
In recent years, complexity compression of neural network (NN)-based speech enhancement (SE) models has gradually attracted the attention of researchers, especially in scenarios with limited hardware resources or strict latency requirements. The main difficulties and challenges lie in achieving a balance between complexity and performance according to the characteristics of the task. In this paper, we propose an intra-inter set knowledge distillation (KD) framework with time-frequency calibration (I$^2$S-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. Secondly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.
Abstract (translated)
近年来,基于神经网络(NN)的语音增强(SE)模型在复杂度压缩方面逐渐引起了研究人员的关注,尤其是在硬件资源有限或有严格延迟要求的情况下。主要的困难和挑战在于根据任务特点实现复杂性和性能之间的平衡。本文提出了一种结合时间频率校准的内部-外部集知识蒸馏框架(I²S-TFCKD)用于语音增强。不同于之前的SE蒸馏策略,该框架充分利用了语音的时间频率差分信息,并促进了全局知识流动。 首先,我们提出了基于双流时频交叉校准的多层交互式蒸馏方法,分别计算时间域和频率域中的教师-学生相似性校准权重并进行跨权重调整,从而根据语音特性实现不同层次之间精细的蒸馏贡献分配。其次,我们构建了一个用于内部集与外部集关联性的协作蒸馏范式。在相关联的一组内,多层教师-学生特征以成对方式进行匹配以便进行校准后的蒸馏。随后,通过残差融合从每个相关的集合中生成代表性特征,形成使跨集合知识交互成为可能的融合特征集。 提出的蒸馏策略应用于在L3DAS23挑战赛的SE赛道上排名第一的双路径膨胀卷积循环网络(DPDCRN)模型中。客观评估表明,所提出的蒸馏策略能够持续有效地提升低复杂度学生模型的表现,并优于其他蒸馏方案。
URL
https://arxiv.org/abs/2506.13127