Paper Reading AI Learner

Future Video Prediction from a Single Frame for Video Anomaly Detection

2023-08-15 14:04:50
Mohammad Baradaran, Robert Bergevin

Abstract

Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.

Abstract (translated)

视频异常检测(VAD)是计算机视觉中一个重要的但具有挑战性的任务。其主要挑战源于训练样本不足以模型所有异常案例的罕见性。因此,半监督异常检测方法越来越受到关注,因为它们专注于建模正常情况,并通过测量与正常模式 Deviations 的差异来检测异常。尽管这些方法在建模正常运动和外观方面取得了令人印象深刻的进步,但到目前为止,长期运动建模还没有得到 effectively 的探索。受到未来帧预测代理任务的能力启发,我们引入了从单个帧预测未来视频的任务,并将其作为视频异常检测中的新型代理任务。这个代理任务可以减轻以前方法在学习更长的运动模式方面的挑战。此外,我们替换了初始和未来的 raw 帧及其相应的语义分割地图,这不仅使方法能够识别物体类别,还使模型的预测任务变得更加简单。在基准数据集(ShanghaiTech、UCCSD-Ped1 和 UCSD-Ped2)上进行广泛的实验表明,这种方法的有效性和与 SOTA 基于预测的 VAD 方法相比的性能优越性。

URL

https://arxiv.org/abs/2308.07783

PDF

https://arxiv.org/pdf/2308.07783.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot