Paper Reading AI Learner

SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

2025-08-13 16:09:36
Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, Weiqing Huang

Abstract

Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at this https URL.

Abstract (translated)

在数字取证领域,面部伪造视频的检测仍然是一个艰巨的挑战,特别是在面对未见过的数据集和常见扰动时。本文通过利用音频与视觉语音元素之间的协同作用来应对这一问题,并提出了一种新的方法:基于视听语音表示学习的方法。我们的研究受到这样一个发现的启发,即包含丰富语音内容的音频信号可以提供精确的信息,有效反映面部动作。 为此,我们首先在真实视频上通过对自我监督掩码预测任务的学习,获得精确的视听语音表示,这一过程同时编码了局部和全局语义信息。然后,我们将得出的模型直接应用到伪造检测的任务中去。广泛的实验表明,在没有使用任何假视频进行模型训练的情况下,我们的方法在跨数据集泛化能力和鲁棒性方面优于当前最先进的方法。 代码可在[此处](https://this https URL)获得。(请注意,URL中的“this”应替换为实际的链接地址)

URL

https://arxiv.org/abs/2508.09913

PDF

https://arxiv.org/pdf/2508.09913.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot