Abstract
Given a pair of source and reference speech recordings, audio-to-audio (A2A) style transfer involves the generation of an output speech that mimics the style characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a novel framework, termed as A2A Zero-shot Emotion Style Transfer (A2A-ZEST), that enables the transfer of reference emotional attributes to the source while retaining its speaker and speech contents. The A2A-ZEST framework consists of an analysis-synthesis pipeline, where the analysis module decomposes speech into semantic tokens, speaker representations, and emotion embeddings. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. This entire paradigm of analysis-synthesis is trained purely in a self-supervised manner with an auto-encoding loss. For A2A emotion style transfer, the emotion embedding extracted from the reference speech along with the rest of the representations from the source speech are used in the synthesis module to generate the style translated speech. In our experiments, we evaluate the converted speech on content/speaker preservation (w.r.t. source) as well as on the effectiveness of the emotion style transfer (w.r.t. reference). The proposal, A2A-ZEST, is shown to improve over other prior works on these evaluations, thereby enabling style transfer without any parallel training data. We also illustrate the application of the proposed work for data augmentation in emotion recognition tasks.
Abstract (translated)
给定一对源录音和参考录音,音频到音频(A2A)风格迁移旨在生成一个输出语音,在保留源语音内容和说话人属性的同时模仿参考录音的风格特性。本文提出了一种新颖框架,命名为 A2A 零样本情感风格迁移 (A2A-ZEST),该框架允许将参考语音的情感属性转移到源语音中,同时保持其说话人身份和语音内容不变。A2A-ZEST 框架包括一个分析-综合流水线,在其中分析模块将语音分解为语义标记、说话人表示和情感嵌入。利用这些表示形式学习音高曲线估计器和持续时间预测器。此外,设计了一个合成模块,用于根据输入表示及推导出的参数生成语音。整个分析-综合流程完全以自监督方式训练,并使用自动编码损失函数。 对于 A2A 情感风格迁移,从参考录音中提取的情感嵌入与源录音中的其余表示共同被用来在合成模块内生成风格转换后的语音。在实验过程中,我们评估了转换语音在保留内容和说话人身份(相对于原始音频)以及情感风格转移效果(相对于参考音频)方面的表现。A2A-ZEST 方法在这类评价中超越了先前的工作,从而实现在没有平行训练数据的情况下进行样式迁移。此外,我们也展示了所提出的方案在情感识别任务中的数据增强应用。
URL
https://arxiv.org/abs/2505.17655