Retrieval-Augmented Audio Deepfake Detection

2024-04-22 05:46:40
Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang


With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.

Abstract (translated)

近年来,在语音合成方面的进步包括文本到语音(TTS)和语音转换(VC)系统,这些系统能够生成超现实主义的音频深度伪造,因此人们对它们可能被滥用的问题越来越担忧。然而,大多数深度伪造(DF)检测方法仅依赖单个模型的模糊知识,导致性能瓶颈和透明度问题。受到检索增强生成(RAG)的启发,我们提出了一个检索增强检测(RAD)框架,通过增加与检索样本相似的测试样本来增强检测。我们还将多融合注意分类器扩展到与我们的RAD框架相结合。大量实验证明,与基线方法相比,所提出的RAD框架具有卓越的性能,在ASVspoof 2021 DF集上实现了最先进的成果,同时在2019和2021 LA集上获得了竞争力的结果。进一步的样本分析表明,检索器总是从具有与查询音频相似的相同说话人检索样本,从而提高了检测性能。



