Paper Reading AI Learner

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

2025-06-04 14:42:12
Seymanur Akti, Tuan Nam Nguyen, Alexander Waibel

Abstract

Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.

Abstract (translated)

表达式音色转换的目标是从目标语音中转移说话人的身份以及情感特征到给定的源语音上。在这项工作中,我们改进了一个自监督、非自回归框架,并采用条件变分自动编码器(Conditional Variational Autoencoder, CVAE),专注于减少源语音的音色泄露并提高语言声学特征的解耦合能力,从而更好地进行风格转换。 为了最小化风格泄漏,我们使用多语言离散语音单元来表示内容,并通过基于数据增强的相似性损失和混合样式层归一化(mix-style layer normalization)来强化嵌入。为了提升情感表达的转移效果,我们在交叉注意力机制中引入了局部基频信息,并提取出富含全局音高和能量特征的情感嵌入。 实验结果显示,我们的模型在情绪和说话人相似度方面优于基准模型,展示了其卓越的风格适应能力和减少源语音风格泄漏的能力。

URL

https://arxiv.org/abs/2506.04013

PDF

https://arxiv.org/pdf/2506.04013.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot