Paper Reading AI Learner

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

2024-04-30 01:27:12
Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

Abstract

Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.

Abstract (translated)

Singing voice improvement是一种具有应用价值的新任务,旨在通过校正音高和改进表现力来纠正唱歌的音高,同时不改变原有的音色和内容。现有的方法依赖于成对数据或仅专注于音高的校正。然而,由于专业歌曲和同一人业余歌曲难以获得,唱歌 voice improvement不仅包括音高校正还包括其他方面,如情感和节奏。由于我们提出了一个快速且高保真的唱歌 voice improvement 系统,称为 ConTuner,一个扩散模型与修改条件相结合来生成美化的 Mel-光谱图,其中修改条件由优化音高和表现力组成。对于音高校正,我们建立了从MIDI、频谱 envelop到音高的映射关系。为了使业余唱歌更具表现力,我们在潜在空间中提出了表现力增强器,将业余嗓音音高转换为专业。ConTuner 在汉语和英语歌曲上都实现了满意的的美化效果。消融研究证实了 ConTuner 中的表现力增强器和基于生成器的方法是有效的。

URL

https://arxiv.org/abs/2404.19187

PDF

https://arxiv.org/pdf/2404.19187.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot