Paper Reading AI Learner

EventVAD: Training-Free Event-Aware Video Anomaly Detection

2025-04-17 16:59:04
Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li

Abstract

Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.

Abstract (translated)

视频异常检测(Video Anomaly Detection,VAD)主要关注识别视频中的异常情况。监督方法需要一定量的领域内训练数据,并且通常难以泛化到未见过的异常情况中去。相比之下,无训练的方法利用大型语言模型(Large Language Models, LLMs)固有的世界知识来检测异常,但面临着在定位细粒度视觉转换和多样事件方面的挑战。 因此,我们提出了EventVAD,这是一种基于事件感知的视频异常检测框架,结合了定制化的动态图架构和多模态LLMs,并通过时间-事件推理将二者相结合。具体来说,EventVAD首先使用带有时间衰减约束的动态时空图模型来捕捉以事件为中心的视频特征。然后,它执行自适应噪声过滤,并利用信号比率阈值检测事件边界,这借助于无监督统计特性实现。该统计边界检测模块降低了长时间视频处理对于多模态LLMs(Multimodal LLMs, MLLMs)的复杂性,并通过事件一致性提高了它们的时间推理能力。最后,它采用分层提示策略来引导MLLMS进行推理并最终做出决定。 我们在UCF-Crime和XD-Violence数据集上进行了广泛的实验。结果显示,在无训练设置下,使用7B参数量级MLLM的EventVAD达到了最先进的性能(State-of-the-Art, SOTA),超过了使用7B及以上规模LLMs的强大基线模型。

URL

https://arxiv.org/abs/2504.13092

PDF

https://arxiv.org/pdf/2504.13092.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot