Paper Reading AI Learner

JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection


Abstract

Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.

Abstract (translated)

声音事件检测(SED)从自监督学习(SSL)方法中受益匪浅,特别是用于SED的掩码音频变换器(MAT-SED),它利用掩码块预测来重建缺失的音频片段。然而,虽然有效捕捉全局依赖性,但掩码块预测会扰乱瞬态声音事件,并且缺乏对时间顺序的显式约束,使其不太适合细粒度事件边界的检测。 为了解决这些问题,我们提出了JiTTER(拼图时间变换器用于事件重建),这是一种针对基于变压器的SED改进时序建模能力的自监督学习框架。JiTTER 引入了一种层次化的时间打乱重构策略,在块级和帧级随机地对音频序列进行打乱,迫使模型重新构建正确的时序顺序。这种预训练目标鼓励模型同时学习全局事件结构和细粒度瞬态细节,从而提高其检测具有急剧开始和结束特性的事件的能力。 此外,我们还在块打乱期间加入了噪声注入,提供了一种微妙的扰动机制,进一步规范特征学习并增强模型鲁棒性。在DESED数据集上的实验结果表明,JiTTER 超过了MAT-SED,在PSDS指标上提高了5.89%,证明了显式时间推理在基于SSL的声音事件表示学习中的有效性。 我们的研究发现表明,结构化的时间重构任务相比简单的掩码预测提供了一种更为有效的预训练范例用于声音事件的表示学习。

URL

https://arxiv.org/abs/2502.20857

PDF

https://arxiv.org/pdf/2502.20857.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot