Abstract
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD) models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
Abstract (translated)
随着基于Audio Language Model (ALM)的深度伪造音频的普及,有效的检测方法至关重要。与传统的深度伪造音频生成,往往涉及多步过程并最终使用语音合成器,ALM直接利用神经编解码器直接将离散代码解码成音频。此外,受到大规模数据的影响,ALMs表现出非凡的稳健性和多样性,对当前的音频深度伪造检测(ADD)模型构成了重大挑战。为了有效地检测基于ALM的深度伪造音频,我们专注于ALM基于音频生成的方法,从神经编解码器到波形的转换机制。我们最初构建了Codecfake数据集,一个开源的大型数据集,包括两种语言、数百万个音频样本以及各种测试条件,专门为基于ALM的音频检测定制。此外,为了实现对深度伪造音频的普遍检测,并解决原始SAM中的域升偏见问题,我们提出了CSAM策略,以学习一个域平衡和通用的最小值。实验结果表明,在Codecfake数据集和语音合成器数据集上进行CSAM策略的协同训练,在所有测试条件下的平均等误率(EER)最低,仅为0.616%。
URL
https://arxiv.org/abs/2405.04880