Abstract
Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.
Abstract (translated)
检测AI生成的人脸提出了一个关键挑战:很难捕捉到不同生成技术之间面部区域间的一致结构性关系。当前的方法侧重于特定的伪影,而不是基本的不一致现象,在面对新型生成模型时往往失败。为了解决这一局限性,我们引入了层感知掩码调制视觉变换器(LAMM-ViT),这是一种专为鲁棒的人脸伪造检测设计的视觉变换器模型。该模型在每一层中集成了区域引导多头注意力(RG-MHA)和层感知掩码调制(LAMM)组件。 RG-MHA利用面部地标来创建区域注意图,引导模型审查不同面部区域间的架构不一致性。至关重要的是,单独的LAMM模块基于网络上下文动态生成特定于每一层的参数,包括掩码权重和门控值。这些参数随后调整RG-MHA的行为,使模型能够在网络深度上适应性地调节区域关注点。这种架构便于捕捉到不同生成技术(如GAN和扩散模型)中普遍存在但细微且层级化的伪造线索。 在跨模型泛化测试中,LAMM-ViT表现出卓越的性能,实现了平均准确率(ACC)94.09%(比现有最佳方法高出5.45%),以及平均精确召回率(AP)98.62%(比现有最佳方法高3.09%)。这些结果证明了LAMM-ViT具备出色的泛化能力及其在应对不断演化的合成媒体威胁方面的可靠部署潜力。
URL
https://arxiv.org/abs/2505.07734