Paper Reading AI Learner

The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

2025-12-12 02:49:18
Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Bj\"orn W. Schuller

Abstract

Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.

Abstract (translated)

语音深度伪造检测已经广泛使用低级声学描述符进行了研究。然而,每项研究往往会选择不同的特征集,这使得建立统一的任务表示变得困难。此外,这些特征对于人类来说并不直观感知,因为随着深度伪造生成技术的进步,真实与合成语音之间的区别越来越难以区分。相比之下,情感是当前深度伪造生成器难以完全复制的独特的人类属性,这反映了向真正的人工通用智能的差距。有趣的是,许多现有的声学和语义特征在情绪上具有隐含的相关性。例如,自动语音识别系统识别的语音特征通常会随着情感表达而自然变化。基于这一见解,我们提出了一种新的训练框架,该框架利用情感作为传统深度伪造特征与面向情感表示之间的桥梁。在广泛使用的FakeOrReal和In-the-Wild数据集上进行的实验表明,在准确性方面分别提高了约6%和2%,而在等错误率(EER)方面则减少了多达4%和1%,同时在ASVspoof2019上的表现与现有方法相当。这种方法为所有特征提供了一种统一的训练策略,并提供了可解释的情感导向特性方向,通过情感引导的学习提高了模型性能。

URL

https://arxiv.org/abs/2512.11241

PDF

https://arxiv.org/pdf/2512.11241.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot