Paper Reading AI Learner

SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection

2025-08-13 04:37:36
Ju Yeon Kang, Jaehong Park, Semin Kim, Ji Won Yoon, Nam Soo Kim

Abstract

Recently, diffusion-generated image detection has gained increasing attention, as the rapid advancement of diffusion models has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts. To address this limitation, we explore a fundamental property commonly observed in fake images. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE can be utilized as a discriminative feature for robust detection across diverse generative models. We empirically demonstrate that the proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics.

Abstract (translated)

最近,由于扩散模型的迅速发展引发了对其潜在滥用的关注,因此对由扩散生成的图像检测方法的研究受到了越来越多的关注。尽管现有的检测方法已经取得了令人鼓舞的结果,但它们在面对来自未见过的、分布外(OOD)生成模型所创建的虚假图片时,性能往往会显著下降,因为这些方法主要依赖于特定模型产生的特征。 为了克服这一限制,我们探索了一种在伪造图像中普遍观察到的基本属性。鉴于假图通常比真实图更加符合其描述文本的相似度更高的现象,我们提出了一种新的表示方式——语义感知重构误差(SARE),该方法衡量的是图片与其基于描述文本引导的重建之间的语义差异。 SARE背后的假设是:由于真实的图像往往无法完全通过简单的文字描述来捕捉其复杂的视觉内容,在基于描述文本引导的重构过程中,这些真实图可能会经历明显的语义变化。相比之下,假图与它们的描述非常吻合,因此在相似的重构过程中的语义变化较小。 通过对这种语义变化进行量化,SARE可以作为一种判别性特征被用于不同生成模型间的鲁棒检测中。我们通过实验验证了所提出的方法具有很强的一般化能力,并在包括GenImage和CommunityForensics在内的基准测试上超越了现有的基线方法。

URL

https://arxiv.org/abs/2508.09487

PDF

https://arxiv.org/pdf/2508.09487.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot