Abstract
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at this https URL.
Abstract (translated)
音乐生成最近的进展得益于最先进的 MusicLM,该模型由三个 LM 级联构成,分别用于语义、粗听和细听建模。然而,使用 MusicLM 进行采样需要逐个处理这些 LM 以获取精细的声学代币,这使得计算代价很高,并且无法用于实时生成。与 MusicLM 的质量相当高效的音乐生成仍然是一个重大的挑战。在本文中,我们介绍了 MeLoDy(M 代表音乐,L 代表 LM,D 代表扩散),它是一个 LM 引导的扩散模型,可以生成高质量的音乐音频,同时 MusicLM 中 forward pass 的百分比分别减少了 95.7% 或 99.6%。MeLoDy 从 MusicLM 继承了大量的语义建模 LM 级别,并应用了一个新颖的双路径扩散模型(DPD)和一个音频 VAE-GAN,高效地解码 conditioning 语义代币到波形。DPD 建议同时建模粗听和细听声音,通过在每个去噪步骤中的交叉注意力有效地将语义信息嵌入到潜在部分中。我们的实验结果表明,MeLoDy 优越于 MusicLM,不仅在于它的采样速度和无限连续生成的实际优势,还在于它先进的音乐性、音频质量和文本相关性。我们的样本可在 this https URL 上获取。
URL
https://arxiv.org/abs/2305.15719