Paper Reading AI Learner

Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

2025-12-15 18:54:30
Wenhan Chen, Sezer Karaoglu, Theo Gevers

Abstract

Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

Abstract (translated)

最近基于扩散生成技术的进步使得AI模型能够产生高度逼真的视频,从而加大了可靠检测机制的需求。然而,现有的检测方法仅对生成视频中存在的三维几何模式进行了有限的探索。在本文中,我们使用消失点作为三维几何图案的明确表示方式,揭示了真实视频与AI生成视频之间在几何一致性上的基本差异。我们引入了一种基于3D几何时间一致性的几何感知变换器框架Grab-3D来检测AI生成的视频。 为了实现可靠的评估,我们构建了一个由静态场景组成的AI生成视频数据集,从而能够稳定地提取三维几何特征。我们提出了一种配备了几何位置编码、时序几何注意力机制以及基于EMA(指数移动平均)的几何分类头的几何感知变换器,以明确将3D几何意识注入时间建模中。 实验表明,Grab-3D在检测AI生成视频方面显著优于现有的最先进的方法,并且能够实现对未知生成器的强大跨域泛化能力。

URL

https://arxiv.org/abs/2512.13665

PDF

https://arxiv.org/pdf/2512.13665.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot