Paper Reading AI Learner

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

2025-01-02 07:08:29
Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Wei Tan, Xie Chen

Abstract

Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in this https URL.

Abstract (translated)

近年来,基于自监督学习(SSL)预训练的基础模型在音乐信息学理解任务中取得了成功,这些任务包括音乐标签分类、乐器识别、音调检测等。本文提出了一种用于音乐理解的自监督音乐表示学习模型。与以往研究采用随机投影或现有神经编解码器的方法不同,我们提出的名为MuQ的模型是通过预测由梅尔残差向量量化(Mel-RVQ)生成的令牌来训练的。我们的Mel-RVQ利用了用于梅尔谱图量化的残差线性投影结构,这增强了目标提取的稳定性和效率,并带来了更好的性能。 在各种下游任务中的实验表明,MuQ模型仅使用0.9K小时开源数据进行预训练就超过了以往的自监督音乐表示学习模型。随着训练数据增加到超过160K小时并采用迭代训练方法,模型表现持续提升。为了进一步验证我们模型的优势,我们提出了基于对比学习的联合音乐文本嵌入模型MuQ-MuLan,在MagnaTagATune数据集上实现了零样本音乐标签分类任务中的最佳性能。 相关代码和检查点可以在提供的链接中获取:[https URL](请将方括号内的URL替换为实际地址)。

URL

https://arxiv.org/abs/2501.01108

PDF

https://arxiv.org/pdf/2501.01108.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot