Paper Reading AI Learner

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection

2025-07-17 15:36:39
Keerthi Veeramachaneni, Praveen Tirupattur, Amrit Singh Bedi, Mubarak Shah

Abstract

Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.

Abstract (translated)

近期在生成式人工智能(GenAI)领域的进展显著提升了生成视觉内容的质量。随着由AI生成的视觉内容越来越难以与真实内容区分,检测这些生成内容以对抗虚假信息、保障隐私和防止安全威胁变得至关重要。尽管在识别AI生成图像方面已取得了重大进步,目前针对视频的检测方法主要集中在深度伪造(DeepFakes)上,后者主要涉及人类面部。然而,视频生成领域已经超越了深度伪造技术,创造了急需能够检测通用内容的AI生成视频的新方法的迫切需求。 为了解决这一缺口,我们提出了一种新颖的方法,该方法利用预训练视觉模型来区分真实的和由AI生成的视频。这些预训练模型在大量真实视觉内容上进行过训练,所提取的特征中包含了可以用于鉴别真实与合成视频的内在信号。通过使用这些提取的特征,在无需额外模型训练的情况下实现了高检测性能,并且通过对提取特征之上简单的线性分类层进行训练进一步提升了性能表现。 我们通过一个由我们编译的数据集(VID-AID)验证了我们的方法,该数据集中包含大约10,000个由9种不同的文本到视频生成模型创建的AI生成视频,以及4,000段真实视频,总计超过7小时的视频内容。评估结果表明,我们的方法在检测准确率上达到了平均值高于90%的良好效果,突显了其有效性。 一旦被接受,我们将公开发布代码、预训练模型和我们自己的数据集,以支持这一关键领域的持续研究工作。

URL

https://arxiv.org/abs/2507.13224

PDF

https://arxiv.org/pdf/2507.13224.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot