Paper Reading AI Learner

BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos

2025-06-25 03:30:04
Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang

Abstract

Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: this https URL.

Abstract (translated)

最近在深度生成模型领域的进展显著推动了视频生成技术的发展,然而人工智能生成的视频的真实性依然有限。合成内容经常出现诸如时间上不一致的动作、物理上不可能的轨迹、不自然的对象变形以及局部模糊等视觉缺陷,这些都削弱了其真实性和用户信任度。准确检测并定位这些缺陷对于自动质量控制和指导改进生成模型的发展至关重要。然而,目前研究社区缺乏一个专门用于人工智能生成视频中缺陷定位的全面基准。 现有的数据集要么仅限于视频或帧级别的检测,要么缺少评估定位方法所需的精细空间标注信息。为了填补这一空白,我们引入了BrokenVideos,这是一个包含3,254个由人工智能生成、并带有详细像素级掩码注释以突出显示视觉缺陷区域的基准数据集。每一份注释都通过详细的人员审查来确保高质量的真实度。 我们的实验表明,在BrokenVideos上训练最先进的缺陷检测模型和多模态大型语言模型(MLLMs)能够显著提升它们定位受损区域的能力。通过广泛评估,我们展示了BrokenVideos为在生成视频模型中进行缺陷定位的研究奠定了关键基础,并推动了该领域的发展。数据集可在以下网址获取:[this https URL]。

URL

https://arxiv.org/abs/2506.20103

PDF

https://arxiv.org/pdf/2506.20103.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot