Paper Reading AI Learner

Refining music sample identification with a self-supervised graph neural network

2025-06-17 16:19:21
Aditya Bhattacharjee, Ivan Meresman Higgs, Mark Sandler, Emmanouil Benetos

Abstract

Automatic sample identification (ASID), the detection and identification of portions of audio recordings that have been reused in new musical works, is an essential but challenging task in the field of audio query-based retrieval. While a related task, audio fingerprinting, has made significant progress in accurately retrieving musical content under "real world" (noisy, reverberant) conditions, ASID systems struggle to identify samples that have undergone musical modifications. Thus, a system robust to common music production transformations such as time-stretching, pitch-shifting, effects processing, and underlying or overlaying music is an important open challenge. In this work, we propose a lightweight and scalable encoding architecture employing a Graph Neural Network within a contrastive learning framework. Our model uses only 9% of the trainable parameters compared to the current state-of-the-art system while achieving comparable performance, reaching a mean average precision (mAP) of 44.2%. To enhance retrieval quality, we introduce a two-stage approach consisting of an initial coarse similarity search for candidate selection, followed by a cross-attention classifier that rejects irrelevant matches and refines the ranking of retrieved candidates - an essential capability absent in prior models. In addition, because queries in real-world applications are often short in duration, we benchmark our system for short queries using new fine-grained annotations for the Sample100 dataset, which we publish as part of this work.

Abstract (translated)

自动样本识别(ASID)是指检测和识别音频记录中被重新用于新音乐作品的部分。这一任务在基于音频查询的检索领域中至关重要,但极具挑战性。虽然相关任务——音频指纹化,在处理“现实世界”条件下的音频内容检索方面(即存在噪声和混响的环境)已取得显著进展,但ASID系统却难以识别经过音乐修改后的样本。因此,开发一种能够抵抗常见音乐制作变换(如时间拉伸、音高移动、效果处理以及底层或叠加音乐影响)的影响并保持准确性的系统成为了亟待解决的重要问题。 在此研究中,我们提出了一种轻量级且可扩展的编码架构,采用图神经网络在对比学习框架内进行工作。我们的模型仅使用了现有最先进的系统的9%训练参数便实现了相当的性能表现,在平均精度(mAP)上达到了44.2%。 为了提升检索质量,我们引入了一个两阶段方法:首先通过粗略相似性搜索选择候选样本,随后采用跨注意力分类器来拒绝无关匹配并优化已选候选样本的排序——这是先前模型中所缺乏的重要功能。此外,在实际应用中的查询音频往往时长较短,因此我们在Sample100数据集上使用新的细粒度注释对我们的系统进行了针对短查询的基准测试,并作为本研究的一部分发布了这些新注释。

URL

https://arxiv.org/abs/2506.14684

PDF

https://arxiv.org/pdf/2506.14684.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot