Paper Reading AI Learner

LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

2025-05-12 16:42:19
Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen

Abstract

Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.

Abstract (translated)

检测AI生成的人脸提出了一个关键挑战:很难捕捉到不同生成技术之间面部区域间的一致结构性关系。当前的方法侧重于特定的伪影,而不是基本的不一致现象,在面对新型生成模型时往往失败。为了解决这一局限性,我们引入了层感知掩码调制视觉变换器(LAMM-ViT),这是一种专为鲁棒的人脸伪造检测设计的视觉变换器模型。该模型在每一层中集成了区域引导多头注意力(RG-MHA)和层感知掩码调制(LAMM)组件。 RG-MHA利用面部地标来创建区域注意图,引导模型审查不同面部区域间的架构不一致性。至关重要的是,单独的LAMM模块基于网络上下文动态生成特定于每一层的参数,包括掩码权重和门控值。这些参数随后调整RG-MHA的行为,使模型能够在网络深度上适应性地调节区域关注点。这种架构便于捕捉到不同生成技术(如GAN和扩散模型)中普遍存在但细微且层级化的伪造线索。 在跨模型泛化测试中,LAMM-ViT表现出卓越的性能,实现了平均准确率(ACC)94.09%(比现有最佳方法高出5.45%),以及平均精确召回率(AP)98.62%(比现有最佳方法高3.09%)。这些结果证明了LAMM-ViT具备出色的泛化能力及其在应对不断演化的合成媒体威胁方面的可靠部署潜力。

URL

https://arxiv.org/abs/2505.07734

PDF

https://arxiv.org/pdf/2505.07734.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot