Paper Reading AI Learner

Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels

2025-01-09 03:18:27
Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia

Abstract

Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast amounts of audio data exist, the process of labeling samples as genuine or fake remains labor-intensive and costly. To address this challenge, we propose SIGNL (Spatio-temporal vIsion Graph Non-contrastive Learning), a novel framework that maintains high GNN performance in low-label settings. SIGNL constructs spatio-temporal graphs by representing patches from the audio's visual spectrogram as nodes. These graph structures are modeled using vision graph convolutional (GC) encoders pre-trained through graph non-contrastive learning, a label-free that maximizes the similarity between positive pairs. The pre-trained encoders are then fine-tuned for audio deepfake detection, reducing reliance on labeled data. Experiments demonstrate that SIGNL outperforms state-of-the-art baselines across multiple audio deepfake detection datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. Additionally, SIGNL exhibits strong cross-domain generalization, achieving the lowest EER in evaluations involving diverse attack types and languages in the In-The-Wild dataset.

Abstract (translated)

最近在音频深度伪造检测领域取得的进展利用了图神经网络(GNN)来建模音频数据中的频率和时间依赖性,从而有效识别出深度伪造特征。然而,基于GNN的方法由于需要大量的标注数据来进行图构建,并且为了保持性能稳健还需要大量标签支持,在缺乏充分标注数据的情境下限制了其应用范围。尽管存在海量的音频数据,但对样本进行人工标注以区分真实和伪造的工作量大、成本高。 为解决这一挑战,我们提出了一种新的框架SIGNL(Spatio-temporal vIsion Graph Non-contrastive Learning),该框架在低标签设置下保持了GNN方法的高性能。SIGNL通过将音频视觉光谱图中的补丁表示为节点来构建时空图,并利用预先通过非对比学习训练好的视觉图卷积(GC)编码器来对这些图结构进行建模,这种无标注的方法旨在最大化正样本之间的相似度。然后,预训练的编码器被进一步微调用于音频深度伪造检测任务中,从而减少了对标记数据的需求。 实验表明,SIGNL在多个音频深度伪造数据集上均超越了最先进的基线模型,并且即使仅使用5%的标签数据也能达到最低的等错误率(EER)。此外,SIGNL还表现出强大的跨域泛化能力,在涉及不同攻击类型和语言的In-The-Wild数据集中实现了最低的EER。

URL

https://arxiv.org/abs/2501.04942

PDF

https://arxiv.org/pdf/2501.04942.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot