Abstract
Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Our experiments show that using a more specific musical word like track results in better retrieval performance, while using a less specific term like tag leads to better tagging performance. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more efficient and robust than the conventional word embedding.
Abstract (translated)
翻译:词向量已经成为基于文本的信息检索的必要手段。通常,词向量是从大量的通用和不结构化文本数据中学习的。然而,在音乐领域,词向量可能很难理解音乐上下文或识别音乐相关的实体,如艺术家和曲目。为解决这个问题,我们提出了一个名为 Musical Word Embedding(MWE)的新方法,它涉及从各种类型的文本中学习,包括日常和音乐相关的词汇。我们将 MWE 集成到一个用于标记和检索音乐的音频词共现框架中,使用具有不同音乐特定性的单词,如标签、艺术家和曲目。我们的实验结果表明,使用更具体的音乐词如曲目可以获得更好的检索性能,而使用更不具体的词汇如标签会导致更好的分类性能。为了平衡这个妥协,我们建议使用具有不同音乐特定性的单词进行联合训练。我们在两个数据集(Million Song Dataset 和 MTG-Jamendo)上对四个任务(标签排名预测、音乐标签、基于标签的查询和基于曲目的查询)进行了评估。我们的研究结果表明,与传统词向量相比,所建议的 MWE 更有效且更稳健。
URL
https://arxiv.org/abs/2404.13569