Abstract
Methods that can generate synthetic speech which is perceptually indistinguishable from speech recorded by a human speaker, are easily available. Several incidents report misuse of synthetic speech generated from these methods to commit fraud. To counter such misuse, many methods have been proposed to detect synthetic speech. Some of these detectors are more interpretable, can generalize to detect synthetic speech in the wild and are robust to noise. However, limited work has been done on understanding bias in these detectors. In this work, we examine bias in existing synthetic speech detectors to determine if they will unfairly target a particular gender, age and accent group. We also inspect whether these detectors will have a higher misclassification rate for bona fide speech from speech-impaired speakers w.r.t fluent speakers. Extensive experiments on 6 existing synthetic speech detectors using more than 0.9 million speech signals demonstrate that most detectors are gender, age and accent biased, and future work is needed to ensure fairness. To support future research, we release our evaluation dataset, models used in our study and source code at this https URL.
Abstract (translated)
方法:生成可以让人类说话者感知无法分辨的合成语音的方法很容易获得。几起报道指出,这些方法生成的合成语音被用于欺诈行为。为了应对这种滥用,已经提出了许多方法来检测这些方法生成的合成语音。有些检测器更加可解释,可以扩展以在野外检测合成语音,对噪声有鲁棒性。然而,在理解这些检测器的偏见方面,目前的工作还很少。在这项工作中,我们检查现有合成语音检测器的偏见,以确定它们是否不公平地针对某个性别、年龄和口音组。我们还检查这些检测器对真实语音有较高误分类率的情况,特别是对于流畅说话者。使用超过0.9亿个语音信号对6个现有合成语音探测器进行广泛实验证明,大多数检测器都有性别、年龄和口音偏见,未来需要进一步研究以确保公正。为支持未来的研究,我们发布了我们的评估数据集、本研究中使用的模型及源代码,可在该链接处访问。
URL
https://arxiv.org/abs/2404.10989