Abstract
Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by developing lightweight metrics of acoustic diversity, which we collectively refer to as MAD Speech. We focus on measuring five facets of acoustic diversity: voice, gender, emotion, accent, and background noise. We construct the metrics as a composition of specialized, per-facet embedding models and an aggregation function that measures diversity within the embedding space. Next, we build a series of datasets with a priori known diversity preferences for each facet. Using these datasets, we demonstrate that our proposed metrics achieve a stronger agreement with the ground-truth diversity than baselines. Finally, we showcase the applicability of our proposed metrics across several real-life evaluation scenarios. MAD Speech will be made publicly accessible.
Abstract (translated)
生成式会话模型产生各种声音、语调、和录音条件下的说话,似乎接近自然语言的多样性。然而,生成式说话的音频多样性程度的范围仍然不清楚,因为缺乏适当的指标。为了填补这一空白,我们开发了轻量级的音频多样性指标,我们称之为MAD(多声道音频多样性)说话。我们关注衡量五个音频多样性的方面:声音、性别、情感、口音和背景噪音。我们将指标组合为专门定制的每个方面的嵌入模型和衡量嵌入空间内多样性的聚合函数。接下来,我们构建了一系列具有已知多样性偏好的数据集。使用这些数据集,我们证明了与基线相比,我们提出的指标具有更强的一致性。最后,我们展示了所提出的指标在多个现实世界的评估场景中的应用。MAD说话将公开可访问。
URL
https://arxiv.org/abs/2404.10419