Abstract
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: this https URL.
Abstract (translated)
检测部分深度伪造语音的难度在于,篡改仅发生在短片段中,而周围的声音仍然是真实的。然而,现有的检测方法在根本上受限于可用数据集的质量,许多这些数据集依赖过时的合成系统和生成过程,引入的是特定于数据集的伪影而非现实中的操作线索。为解决这一缺口,我们推出了HQ-MPSD(高质量多语言部分深度伪造语音数据集)。HQ-MPSD使用通过细粒度强制对齐衍生出的语言连贯拼接点构建而成,保留了韵律和语义连续性,并最小化了听觉和视觉边界伪影。该数据集中包含来自八种语言的550位发言者的350.8小时语音,并添加了背景效果以更好地反映现实世界的声学条件。MOS评估和频谱图分析证实了样本的高度感知自然度。我们通过跨语言和跨数据集评估对最先进的检测模型进行了基准测试,所有模型在HQ-MPSD上的性能下降均超过80%。这些结果表明,当低级伪影被移除且多语言及声学多样性引入时,HQ-MPSD揭示了显著的泛化挑战,并为部分深度伪造检测提供了一个更加现实和苛刻的基准测试标准。 数据集可以在以下网址找到:[此链接](this%20https%20URL)。
URL
https://arxiv.org/abs/2512.13012