Paper Reading AI Learner

Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer

2025-05-23 09:18:12
Soumya Dutta, Avni Jain, Sriram Ganapathy

Abstract

Given a pair of source and reference speech recordings, audio-to-audio (A2A) style transfer involves the generation of an output speech that mimics the style characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a novel framework, termed as A2A Zero-shot Emotion Style Transfer (A2A-ZEST), that enables the transfer of reference emotional attributes to the source while retaining its speaker and speech contents. The A2A-ZEST framework consists of an analysis-synthesis pipeline, where the analysis module decomposes speech into semantic tokens, speaker representations, and emotion embeddings. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. This entire paradigm of analysis-synthesis is trained purely in a self-supervised manner with an auto-encoding loss. For A2A emotion style transfer, the emotion embedding extracted from the reference speech along with the rest of the representations from the source speech are used in the synthesis module to generate the style translated speech. In our experiments, we evaluate the converted speech on content/speaker preservation (w.r.t. source) as well as on the effectiveness of the emotion style transfer (w.r.t. reference). The proposal, A2A-ZEST, is shown to improve over other prior works on these evaluations, thereby enabling style transfer without any parallel training data. We also illustrate the application of the proposed work for data augmentation in emotion recognition tasks.

Abstract (translated)

给定一对源录音和参考录音,音频到音频(A2A)风格迁移旨在生成一个输出语音,在保留源语音内容和说话人属性的同时模仿参考录音的风格特性。本文提出了一种新颖框架,命名为 A2A 零样本情感风格迁移 (A2A-ZEST),该框架允许将参考语音的情感属性转移到源语音中,同时保持其说话人身份和语音内容不变。A2A-ZEST 框架包括一个分析-综合流水线,在其中分析模块将语音分解为语义标记、说话人表示和情感嵌入。利用这些表示形式学习音高曲线估计器和持续时间预测器。此外,设计了一个合成模块,用于根据输入表示及推导出的参数生成语音。整个分析-综合流程完全以自监督方式训练,并使用自动编码损失函数。 对于 A2A 情感风格迁移,从参考录音中提取的情感嵌入与源录音中的其余表示共同被用来在合成模块内生成风格转换后的语音。在实验过程中,我们评估了转换语音在保留内容和说话人身份(相对于原始音频)以及情感风格转移效果(相对于参考音频)方面的表现。A2A-ZEST 方法在这类评价中超越了先前的工作,从而实现在没有平行训练数据的情况下进行样式迁移。此外,我们也展示了所提出的方案在情感识别任务中的数据增强应用。

URL

https://arxiv.org/abs/2505.17655

PDF

https://arxiv.org/pdf/2505.17655.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot