Paper Reading AI Learner

Making Reconstruction-based Method Great Again for Video Anomaly Detection

2023-01-28 01:57:57
Yizhou Wang, Can Qin, Yue Bai, Yi Xu, Xu Ma, Yun Fu

Abstract

Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indistinguishable reconstruction errors of normal and abnormal frames during the inference phase. To address such issues, firstly, we get inspiration from transformer and propose ${\textbf S}$patio-${\textbf T}$emporal ${\textbf A}$uto-${\textbf T}$rans-${\textbf E}$ncoder, dubbed as $\textbf{STATE}$, as a new autoencoder model for enhanced consecutive frame reconstruction. Our STATE is equipped with a specifically designed learnable convolutional attention module for efficient temporal learning and reasoning. Secondly, we put forward a novel reconstruction-based input perturbation technique during testing to further differentiate anomalous frames. With the same perturbation magnitude, the testing reconstruction error of the normal frames lowers more than that of the abnormal frames, which contributes to mitigating the overfitting problem of reconstruction. Owing to the high relevance of the frame abnormality and the objects in the frame, we conduct object-level reconstruction using both the raw frame and the corresponding optical flow patches. Finally, the anomaly score is designed based on the combination of the raw and motion reconstruction errors using perturbed inputs. Extensive experiments on benchmark video anomaly detection datasets demonstrate that our approach outperforms previous reconstruction-based methods by a notable margin, and achieves state-of-the-art anomaly detection performance consistently. The code is available at this https URL.

Abstract (translated)

视频异常检测是一个重要但具有挑战性的问题。以往的基于深度学习的方法都采用基于重构或预测的方法。然而,现有的基于重构的方法1)依赖于旧式的卷积自编码器,并不擅长建模时间依赖;2)往往过度训练样本,导致正常帧和异常帧在推理阶段期间的重构误差无法区分。为了解决这些问题,我们首先从Transformer中受到启发,并提出了一种名为State的新自编码器模型,并将其称为“状态”,用于增强连续帧重构。我们的State配备了专门设计的可学习卷积注意力模块,以高效的时间学习和推理。其次,我们提出了一种新的基于重构的输入扰动技术,在测试时进一步区分异常帧。与相同的扰动强度一样,正常帧的测试重构误差更低,这有助于减轻重构的过度训练问题。由于帧异常性和帧中的物体有很高的相关性,我们使用原始帧和相应的光学流补丁进行物体级重构。最后,异常检测得分是根据扰动输入组合的原始和运动重构误差设计的。在基准视频异常检测数据集上进行广泛的实验表明,我们的方法比过去的基于重构的方法表现更好,并实现了最先进的异常检测性能。代码在此httpsURL上可用。

URL

https://arxiv.org/abs/2301.12048

PDF

https://arxiv.org/pdf/2301.12048.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot