Paper Reading AI Learner

A Survey on Cross-Modal Interaction Between Music and Multimodal Data

2025-04-17 09:58:38
Sifei Li, Mining Tan, Feier Shen, Minyan Luo, Zijiao Yin, Fan Tang, Weiming Dong, Changsheng Xu

Abstract

Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.

Abstract (translated)

跨模态学习在各个行业中推动了创新,尤其是在音乐领域。通过促进更直观的交互体验和增强沉浸感,它不仅降低了进入音乐领域的门槛,还增加了其总体吸引力。本调查旨在全面回顾与音乐相关的跨模态任务,阐明音乐如何贡献于跨模态学习,并为致力于拓展计算音乐边界的科研人员提供见解。不同于文本和图像往往具有语义或视觉直观性,音乐主要通过听觉感知与人类互动,使其数据表示本质上不够直观。因此,本文首先介绍了音乐的表示形式,并提供了关于音乐数据集的概览。随后,我们将音乐与多模态数据之间的跨模态交互分类为三种类型:以音乐驱动的跨模态交互、面向音乐的跨模态交互和双向音乐跨模态交互。对于每一类,我们系统地追踪了相关子任务的发展,分析现有局限,并讨论新兴趋势。此外,本文还全面总结了与音乐相关的多模态任务中使用的数据集和评估指标,为未来研究提供基准参考。最后,我们将探讨当前涉及音乐的跨模态互动所面临的挑战,并提出未来研究的潜在方向。

URL

https://arxiv.org/abs/2504.12796

PDF

https://arxiv.org/pdf/2504.12796.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot