Abstract
Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast amounts of audio data exist, the process of labeling samples as genuine or fake remains labor-intensive and costly. To address this challenge, we propose SIGNL (Spatio-temporal vIsion Graph Non-contrastive Learning), a novel framework that maintains high GNN performance in low-label settings. SIGNL constructs spatio-temporal graphs by representing patches from the audio's visual spectrogram as nodes. These graph structures are modeled using vision graph convolutional (GC) encoders pre-trained through graph non-contrastive learning, a label-free that maximizes the similarity between positive pairs. The pre-trained encoders are then fine-tuned for audio deepfake detection, reducing reliance on labeled data. Experiments demonstrate that SIGNL outperforms state-of-the-art baselines across multiple audio deepfake detection datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. Additionally, SIGNL exhibits strong cross-domain generalization, achieving the lowest EER in evaluations involving diverse attack types and languages in the In-The-Wild dataset.
Abstract (translated)
最近在音频深度伪造检测领域取得的进展利用了图神经网络(GNN)来建模音频数据中的频率和时间依赖性,从而有效识别出深度伪造特征。然而,基于GNN的方法由于需要大量的标注数据来进行图构建,并且为了保持性能稳健还需要大量标签支持,在缺乏充分标注数据的情境下限制了其应用范围。尽管存在海量的音频数据,但对样本进行人工标注以区分真实和伪造的工作量大、成本高。 为解决这一挑战,我们提出了一种新的框架SIGNL(Spatio-temporal vIsion Graph Non-contrastive Learning),该框架在低标签设置下保持了GNN方法的高性能。SIGNL通过将音频视觉光谱图中的补丁表示为节点来构建时空图,并利用预先通过非对比学习训练好的视觉图卷积(GC)编码器来对这些图结构进行建模,这种无标注的方法旨在最大化正样本之间的相似度。然后,预训练的编码器被进一步微调用于音频深度伪造检测任务中,从而减少了对标记数据的需求。 实验表明,SIGNL在多个音频深度伪造数据集上均超越了最先进的基线模型,并且即使仅使用5%的标签数据也能达到最低的等错误率(EER)。此外,SIGNL还表现出强大的跨域泛化能力,在涉及不同攻击类型和语言的In-The-Wild数据集中实现了最低的EER。
URL
https://arxiv.org/abs/2501.04942