Abstract
Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.
Abstract (translated)
声音事件检测(SED)从自监督学习(SSL)方法中受益匪浅,特别是用于SED的掩码音频变换器(MAT-SED),它利用掩码块预测来重建缺失的音频片段。然而,虽然有效捕捉全局依赖性,但掩码块预测会扰乱瞬态声音事件,并且缺乏对时间顺序的显式约束,使其不太适合细粒度事件边界的检测。 为了解决这些问题,我们提出了JiTTER(拼图时间变换器用于事件重建),这是一种针对基于变压器的SED改进时序建模能力的自监督学习框架。JiTTER 引入了一种层次化的时间打乱重构策略,在块级和帧级随机地对音频序列进行打乱,迫使模型重新构建正确的时序顺序。这种预训练目标鼓励模型同时学习全局事件结构和细粒度瞬态细节,从而提高其检测具有急剧开始和结束特性的事件的能力。 此外,我们还在块打乱期间加入了噪声注入,提供了一种微妙的扰动机制,进一步规范特征学习并增强模型鲁棒性。在DESED数据集上的实验结果表明,JiTTER 超过了MAT-SED,在PSDS指标上提高了5.89%,证明了显式时间推理在基于SSL的声音事件表示学习中的有效性。 我们的研究发现表明,结构化的时间重构任务相比简单的掩码预测提供了一种更为有效的预训练范例用于声音事件的表示学习。
URL
https://arxiv.org/abs/2502.20857