Paper Reading AI Learner

CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning

2024-04-24 13:10:35
Haolin Wu, Jing Chen, Ruiying Du, Cong Wu, Kun He, Xingcan Shang, Hao Ren, Guowen Xu

Abstract

The increasing prevalence of audio deepfakes poses significant security threats, necessitating robust detection methods. While existing detection systems exhibit promise, their robustness against malicious audio manipulations remains underexplored. To bridge the gap, we undertake the first comprehensive study of the susceptibility of the most widely adopted audio deepfake detectors to manipulation attacks. Surprisingly, even manipulations like volume control can significantly bypass detection without affecting human perception. To address this, we propose CLAD (Contrastive Learning-based Audio deepfake Detector) to enhance the robustness against manipulation attacks. The key idea is to incorporate contrastive learning to minimize the variations introduced by manipulations, therefore enhancing detection robustness. Additionally, we incorporate a length loss, aiming to improve the detection accuracy by clustering real audios more closely in the feature space. We comprehensively evaluated the most widely adopted audio deepfake detection models and our proposed CLAD against various manipulation attacks. The detection models exhibited vulnerabilities, with FAR rising to 36.69%, 31.23%, and 51.28% under volume control, fading, and noise injection, respectively. CLAD enhanced robustness, reducing the FAR to 0.81% under noise injection and consistently maintaining an FAR below 1.63% across all tests. Our source code and documentation are available in the artifact repository (this https URL).

Abstract (translated)

音频深度伪造技术的普遍增加带来了显著的安全威胁,需要强大的检测方法。虽然现有的检测系统表现出巨大的潜力,但它们对抗恶意音频编辑的鲁棒性仍然缺乏深入的研究。为了弥合这个差距,我们开展了第一个全面研究,旨在评估最广泛采用的音频深度伪造检测器对编辑攻击的易感性。令人惊讶的是,即使包括像音量控制在内的编辑攻击也可以在没有任何影响人类感知的情况下显著绕过检测。为了应对这个问题,我们提出了CLAD(基于对比学习的音频深度伪造检测器),以增强对抗编辑攻击的鲁棒性。关键思想是利用对比学习最小化编辑操作带来的变化,从而提高检测器的鲁棒性。此外,我们还引入了长度损失,旨在通过将实音频在特征空间中聚类得更紧密来提高检测准确性。我们对最广泛采用的音频深度伪造检测模型和我们的CLAD进行了全面评估,对抗各种编辑攻击。检测模型显示出漏洞,在音量控制、衰减和噪音注入等情况下,FAR分别上升至36.69%、31.23%和51.28%。CLAD增强了鲁棒性,在噪音注入下的FAR降至0.81%,并且在所有测试中都保持了FAR低于1.63%的稳定性。我们的源代码和文档可在此处下载(此https URL)。

URL

https://arxiv.org/abs/2404.15854

PDF

https://arxiv.org/pdf/2404.15854.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot