Paper Reading AI Learner

Efficient Neural Music Generation

2023-05-25 05:02:35
Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang

Abstract

Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at this https URL.

Abstract (translated)

音乐生成最近的进展得益于最先进的 MusicLM,该模型由三个 LM 级联构成,分别用于语义、粗听和细听建模。然而,使用 MusicLM 进行采样需要逐个处理这些 LM 以获取精细的声学代币,这使得计算代价很高,并且无法用于实时生成。与 MusicLM 的质量相当高效的音乐生成仍然是一个重大的挑战。在本文中,我们介绍了 MeLoDy(M 代表音乐,L 代表 LM,D 代表扩散),它是一个 LM 引导的扩散模型,可以生成高质量的音乐音频,同时 MusicLM 中 forward pass 的百分比分别减少了 95.7% 或 99.6%。MeLoDy 从 MusicLM 继承了大量的语义建模 LM 级别,并应用了一个新颖的双路径扩散模型(DPD)和一个音频 VAE-GAN,高效地解码 conditioning 语义代币到波形。DPD 建议同时建模粗听和细听声音,通过在每个去噪步骤中的交叉注意力有效地将语义信息嵌入到潜在部分中。我们的实验结果表明,MeLoDy 优越于 MusicLM,不仅在于它的采样速度和无限连续生成的实际优势,还在于它先进的音乐性、音频质量和文本相关性。我们的样本可在 this https URL 上获取。

URL

https://arxiv.org/abs/2305.15719

PDF

https://arxiv.org/pdf/2305.15719.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot