Paper Reading AI Learner

Musical Word Embedding for Music Tagging and Retrieval

2024-04-21 08:19:20
SeungHeon Doh, Jongpil Lee, Dasaem Jeong, Juhan Nam

Abstract

Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Our experiments show that using a more specific musical word like track results in better retrieval performance, while using a less specific term like tag leads to better tagging performance. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more efficient and robust than the conventional word embedding.

Abstract (translated)

翻译:词向量已经成为基于文本的信息检索的必要手段。通常,词向量是从大量的通用和不结构化文本数据中学习的。然而,在音乐领域,词向量可能很难理解音乐上下文或识别音乐相关的实体,如艺术家和曲目。为解决这个问题,我们提出了一个名为 Musical Word Embedding(MWE)的新方法,它涉及从各种类型的文本中学习,包括日常和音乐相关的词汇。我们将 MWE 集成到一个用于标记和检索音乐的音频词共现框架中,使用具有不同音乐特定性的单词,如标签、艺术家和曲目。我们的实验结果表明,使用更具体的音乐词如曲目可以获得更好的检索性能,而使用更不具体的词汇如标签会导致更好的分类性能。为了平衡这个妥协,我们建议使用具有不同音乐特定性的单词进行联合训练。我们在两个数据集(Million Song Dataset 和 MTG-Jamendo)上对四个任务(标签排名预测、音乐标签、基于标签的查询和基于曲目的查询)进行了评估。我们的研究结果表明,与传统词向量相比,所建议的 MWE 更有效且更稳健。

URL

https://arxiv.org/abs/2404.13569

PDF

https://arxiv.org/pdf/2404.13569.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot