Abstract
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
Abstract (translated)
Singing voice improvement是一种具有应用价值的新任务,旨在通过校正音高和改进表现力来纠正唱歌的音高,同时不改变原有的音色和内容。现有的方法依赖于成对数据或仅专注于音高的校正。然而,由于专业歌曲和同一人业余歌曲难以获得,唱歌 voice improvement不仅包括音高校正还包括其他方面,如情感和节奏。由于我们提出了一个快速且高保真的唱歌 voice improvement 系统,称为 ConTuner,一个扩散模型与修改条件相结合来生成美化的 Mel-光谱图,其中修改条件由优化音高和表现力组成。对于音高校正,我们建立了从MIDI、频谱 envelop到音高的映射关系。为了使业余唱歌更具表现力,我们在潜在空间中提出了表现力增强器,将业余嗓音音高转换为专业。ConTuner 在汉语和英语歌曲上都实现了满意的的美化效果。消融研究证实了 ConTuner 中的表现力增强器和基于生成器的方法是有效的。
URL
https://arxiv.org/abs/2404.19187