Abstract
Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.
Abstract (translated)
近期在生成式人工智能(GenAI)领域的进展显著提升了生成视觉内容的质量。随着由AI生成的视觉内容越来越难以与真实内容区分,检测这些生成内容以对抗虚假信息、保障隐私和防止安全威胁变得至关重要。尽管在识别AI生成图像方面已取得了重大进步,目前针对视频的检测方法主要集中在深度伪造(DeepFakes)上,后者主要涉及人类面部。然而,视频生成领域已经超越了深度伪造技术,创造了急需能够检测通用内容的AI生成视频的新方法的迫切需求。 为了解决这一缺口,我们提出了一种新颖的方法,该方法利用预训练视觉模型来区分真实的和由AI生成的视频。这些预训练模型在大量真实视觉内容上进行过训练,所提取的特征中包含了可以用于鉴别真实与合成视频的内在信号。通过使用这些提取的特征,在无需额外模型训练的情况下实现了高检测性能,并且通过对提取特征之上简单的线性分类层进行训练进一步提升了性能表现。 我们通过一个由我们编译的数据集(VID-AID)验证了我们的方法,该数据集中包含大约10,000个由9种不同的文本到视频生成模型创建的AI生成视频,以及4,000段真实视频,总计超过7小时的视频内容。评估结果表明,我们的方法在检测准确率上达到了平均值高于90%的良好效果,突显了其有效性。 一旦被接受,我们将公开发布代码、预训练模型和我们自己的数据集,以支持这一关键领域的持续研究工作。
URL
https://arxiv.org/abs/2507.13224