Abstract
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
Abstract (translated)
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
URL
https://arxiv.org/abs/2405.02961