Making Reconstruction-based Method Great Again for Video Anomaly Detection

Abstract
Abstract (translated)
URL
PDF

Abstract

Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indistinguishable reconstruction errors of normal and abnormal frames during the inference phase. To address such issues, firstly, we get inspiration from transformer and propose ${\textbf S}$patio-${\textbf T}$emporal ${\textbf A}$uto-${\textbf T}$rans-${\textbf E}$ncoder, dubbed as $\textbf{STATE}$, as a new autoencoder model for enhanced consecutive frame reconstruction. Our STATE is equipped with a specifically designed learnable convolutional attention module for efficient temporal learning and reasoning. Secondly, we put forward a novel reconstruction-based input perturbation technique during testing to further differentiate anomalous frames. With the same perturbation magnitude, the testing reconstruction error of the normal frames lowers more than that of the abnormal frames, which contributes to mitigating the overfitting problem of reconstruction. Owing to the high relevance of the frame abnormality and the objects in the frame, we conduct object-level reconstruction using both the raw frame and the corresponding optical flow patches. Finally, the anomaly score is designed based on the combination of the raw and motion reconstruction errors using perturbed inputs. Extensive experiments on benchmark video anomaly detection datasets demonstrate that our approach outperforms previous reconstruction-based methods by a notable margin, and achieves state-of-the-art anomaly detection performance consistently. The code is available at this https URL.

Abstract (translated)

视频异常检测是一个重要但具有挑战性的问题。以往的基于深度学习的方法都采用基于重构或预测的方法。然而,现有的基于重构的方法1)依赖于旧式的卷积自编码器,并不擅长建模时间依赖;2)往往过度训练样本,导致正常帧和异常帧在推理阶段期间的重构误差无法区分。为了解决这些问题,我们首先从Transformer中受到启发,并提出了一种名为State的新自编码器模型,并将其称为“状态”,用于增强连续帧重构。我们的State配备了专门设计的可学习卷积注意力模块,以高效的时间学习和推理。其次,我们提出了一种新的基于重构的输入扰动技术,在测试时进一步区分异常帧。与相同的扰动强度一样,正常帧的测试重构误差更低,这有助于减轻重构的过度训练问题。由于帧异常性和帧中的物体有很高的相关性,我们使用原始帧和相应的光学流补丁进行物体级重构。最后,异常检测得分是根据扰动输入组合的原始和运动重构误差设计的。在基准视频异常检测数据集上进行广泛的实验表明,我们的方法比过去的基于重构的方法表现更好,并实现了最先进的异常检测性能。代码在此httpsURL上可用。

URL

https://arxiv.org/abs/2301.12048

PDF

https://arxiv.org/pdf/2301.12048.pdf