Paper Reading AI Learner

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

2024-04-10 02:02:51
Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

Abstract

To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

Abstract (translated)

为了实现一个灵活的推荐和检索系统,通过关注音乐作品的多部分元素并允许用户选择他们要关注的元素,计算音乐相似度的方法是值得推荐的。之前的研究提出了基于每个独奏声音使用多个独立网络计算音乐相似度的方法,但在搜索引擎中使用每个信号作为查询是不实用的。使用分离的独奏声音的结果是因为伪影而导致的准确性较低。在本文中,我们提出了一种计算每个独奏声音相似度的方法,该方法使用单个网络对混合声音进行输入。具体来说,我们设计了一个具有解耦维度的每个乐器的单相似性嵌入空间,该空间通过条件相似性网络提取,并通过三元组损失使用掩码进行训练。实验结果表明:(1)与使用单独声音作为输入的独奏网络相比,所提出的方法可以获得更准确的特征表示;(2)每个子嵌入空间都可以保留相应乐器的特征;(3)通过所提出的方法选择关注每个独奏乐器的类似音乐作品可以获得人类的同意,尤其是在鼓和吉他中。

URL

https://arxiv.org/abs/2404.06682

PDF

https://arxiv.org/pdf/2404.06682.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot