Abstract
Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.
Abstract (translated)
语音深度伪造检测已经广泛使用低级声学描述符进行了研究。然而,每项研究往往会选择不同的特征集,这使得建立统一的任务表示变得困难。此外,这些特征对于人类来说并不直观感知,因为随着深度伪造生成技术的进步,真实与合成语音之间的区别越来越难以区分。相比之下,情感是当前深度伪造生成器难以完全复制的独特的人类属性,这反映了向真正的人工通用智能的差距。有趣的是,许多现有的声学和语义特征在情绪上具有隐含的相关性。例如,自动语音识别系统识别的语音特征通常会随着情感表达而自然变化。基于这一见解,我们提出了一种新的训练框架,该框架利用情感作为传统深度伪造特征与面向情感表示之间的桥梁。在广泛使用的FakeOrReal和In-the-Wild数据集上进行的实验表明,在准确性方面分别提高了约6%和2%,而在等错误率(EER)方面则减少了多达4%和1%,同时在ASVspoof2019上的表现与现有方法相当。这种方法为所有特征提供了一种统一的训练策略,并提供了可解释的情感导向特性方向,通过情感引导的学习提高了模型性能。
URL
https://arxiv.org/abs/2512.11241