Paper Reading AI Learner

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

2024-05-03 15:27:11
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

Abstract

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained this http URL this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.

Abstract (translated)

泛化是当前音频深度伪造检测器面临的主要问题,它们在非分布数据上提供不可靠的结果。鉴于越来越精确的合成方法正在开发,在数据集他们并未训练过的这个http网址上,设计一些在数据集上表现良好的技术对提高深度伪造检测器的性能具有至关重要的意义。为此,我们将检测问题重新表述为说话人验证框架,通过测试语音样本与声称身份的语音之间的不匹配来暴露假音频。在这种范式下,在训练过程中不需要任何假音频样本,切断与生成方法之间的联系,并确保充分的泛化能力。大型通用预训练模型的特征是由其提取的,无需在特定的伪造检测或说话人验证数据集上进行训练或微调。在检测时只需要检测身份的有限个语音片段。在社区中广泛使用的几个数据集的实验表明,基于预训练模型的检测器实现卓越的性能,表现出很强的泛化能力,与在分布数据上的监督方法相媲美,在离散数据上大大超过了它们。

URL

https://arxiv.org/abs/2405.02179

PDF

https://arxiv.org/pdf/2405.02179.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot