Abstract
Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
Abstract (translated)
最近基于扩散生成技术的进步使得AI模型能够产生高度逼真的视频,从而加大了可靠检测机制的需求。然而,现有的检测方法仅对生成视频中存在的三维几何模式进行了有限的探索。在本文中,我们使用消失点作为三维几何图案的明确表示方式,揭示了真实视频与AI生成视频之间在几何一致性上的基本差异。我们引入了一种基于3D几何时间一致性的几何感知变换器框架Grab-3D来检测AI生成的视频。 为了实现可靠的评估,我们构建了一个由静态场景组成的AI生成视频数据集,从而能够稳定地提取三维几何特征。我们提出了一种配备了几何位置编码、时序几何注意力机制以及基于EMA(指数移动平均)的几何分类头的几何感知变换器,以明确将3D几何意识注入时间建模中。 实验表明,Grab-3D在检测AI生成视频方面显著优于现有的最先进的方法,并且能够实现对未知生成器的强大跨域泛化能力。
URL
https://arxiv.org/abs/2512.13665