Analysis of Real-Time Hostile Activitiy Detection from Spatiotemporal Features Using Time Distributed Deep CNNs, RNNs and Attention-Based Mechanisms

Abstract
Abstract (translated)
URL
PDF

Abstract

Real-time video surveillance, through CCTV camera systems has become essential for ensuring public safety which is a priority today. Although CCTV cameras help a lot in increasing security, these systems require constant human interaction and monitoring. To eradicate this issue, intelligent surveillance systems can be built using deep learning video classification techniques that can help us automate surveillance systems to detect violence as it happens. In this research, we explore deep learning video classification techniques to detect violence as they are happening. Traditional image classification techniques fall short when it comes to classifying videos as they attempt to classify each frame separately for which the predictions start to flicker. Therefore, many researchers are coming up with video classification techniques that consider spatiotemporal features while classifying. However, deploying these deep learning models with methods such as skeleton points obtained through pose estimation and optical flow obtained through depth sensors, are not always practical in an IoT environment. Although these techniques ensure a higher accuracy score, they are computationally heavier. Keeping these constraints in mind, we experimented with various video classification and action recognition techniques such as ConvLSTM, LRCN (with both custom CNN layers and VGG-16 as feature extractor) CNNTransformer and C3D. We achieved a test accuracy of 80% on ConvLSTM, 83.33% on CNN-BiLSTM, 70% on VGG16-BiLstm ,76.76% on CNN-Transformer and 80% on C3D.

Abstract (translated)

实时视频监控通过CCTV camera系统已经成为确保公共安全的重要措施,而这一措施在当今优先级非常高。尽管CCTV摄像头在增加安全性方面做了很多工作,但这些系统需要不断的人类交互和监测。为了解决这个问题,我们可以使用深度学习视频分类技术来自动化监控系统,以便在发生时检测暴力行为。在这项研究中,我们探索了深度学习视频分类技术来检测正在发生的暴力行为。传统的图像分类技术在分类视频时存在一定的局限性,因为它们试图分别对待每个帧进行分类,这会导致预测开始闪烁。因此,许多研究人员正在开发视频分类技术,考虑时间和空间特征的同时分类。然而,在IoT环境中部署这些深度学习模型,如通过姿态估计获取骨骼点和控制深度传感器获取的光学流的方法,并不总是实际可行的。尽管这些方法可以保证更高的准确率,但它们的计算量相对较大。考虑到这些限制,我们尝试了各种视频分类和动作识别技术,如ConvLSTM、LRCN(同时使用自定义CNN层和VGG-16作为特征提取器)、CNNTransformer和C3D。我们在ConvLSTM上实现了80%的测试准确率,在CNN-BiLSTM上达到了83.33%,在VGG16-BiLstm上达到了70%,在CNNTransformer上达到了76.76%,在C3D上达到了80%。

URL

https://arxiv.org/abs/2302.11027

PDF

https://arxiv.org/pdf/2302.11027.pdf