Paper Reading AI Learner

Analysis of Real-Time Hostile Activitiy Detection from Spatiotemporal Features Using Time Distributed Deep CNNs, RNNs and Attention-Based Mechanisms

2023-02-21 22:02:39
Labib Ahmed Siddique, Rabita Junhai, Tanzim Reza, Salman Sayeed Khan, Tanvir Rahman

Abstract

Real-time video surveillance, through CCTV camera systems has become essential for ensuring public safety which is a priority today. Although CCTV cameras help a lot in increasing security, these systems require constant human interaction and monitoring. To eradicate this issue, intelligent surveillance systems can be built using deep learning video classification techniques that can help us automate surveillance systems to detect violence as it happens. In this research, we explore deep learning video classification techniques to detect violence as they are happening. Traditional image classification techniques fall short when it comes to classifying videos as they attempt to classify each frame separately for which the predictions start to flicker. Therefore, many researchers are coming up with video classification techniques that consider spatiotemporal features while classifying. However, deploying these deep learning models with methods such as skeleton points obtained through pose estimation and optical flow obtained through depth sensors, are not always practical in an IoT environment. Although these techniques ensure a higher accuracy score, they are computationally heavier. Keeping these constraints in mind, we experimented with various video classification and action recognition techniques such as ConvLSTM, LRCN (with both custom CNN layers and VGG-16 as feature extractor) CNNTransformer and C3D. We achieved a test accuracy of 80% on ConvLSTM, 83.33% on CNN-BiLSTM, 70% on VGG16-BiLstm ,76.76% on CNN-Transformer and 80% on C3D.

Abstract (translated)

实时视频监控通过CCTV camera系统已经成为确保公共安全的重要措施,而这一措施在当今优先级非常高。尽管CCTV摄像头在增加安全性方面做了很多工作,但这些系统需要不断的人类交互和监测。为了解决这个问题,我们可以使用深度学习视频分类技术来自动化监控系统,以便在发生时检测暴力行为。在这项研究中,我们探索了深度学习视频分类技术来检测正在发生的暴力行为。传统的图像分类技术在分类视频时存在一定的局限性,因为它们试图分别对待每个帧进行分类,这会导致预测开始闪烁。因此,许多研究人员正在开发视频分类技术,考虑时间和空间特征的同时分类。然而,在IoT环境中部署这些深度学习模型,如通过姿态估计获取骨骼点和控制深度传感器获取的光学流的方法,并不总是实际可行的。尽管这些方法可以保证更高的准确率,但它们的计算量相对较大。考虑到这些限制,我们尝试了各种视频分类和动作识别技术,如ConvLSTM、LRCN(同时使用自定义CNN层和VGG-16作为特征提取器)、CNNTransformer和C3D。我们在ConvLSTM上实现了80%的测试准确率,在CNN-BiLSTM上达到了83.33%,在VGG16-BiLstm上达到了70%,在CNNTransformer上达到了76.76%,在C3D上达到了80%。

URL

https://arxiv.org/abs/2302.11027

PDF

https://arxiv.org/pdf/2302.11027.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot