Paper Reading AI Learner

Harnessing Large Language Models for Training-free Video Anomaly Detection

2024-04-01 09:34:55
Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, Elisa Ricci

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Abstract (translated)

视频异常检测(VAD)旨在在视频中查找异常事件。现有的工作主要依赖于训练深度模型来学习与视频级别监督、一类别监督或无监督设置相关的正态分布。基于训练的方法容易受到领域特定的影响,因此在实际部署中,任何领域变化都会涉及数据收集和模型训练,导致成本高昂。在本文中,我们彻底背离了以前的尝试,提出了基于语言的VAD(LAVAD),一种在无训练数据和无监督模型的情况下解决VAD的新方法,利用预训练的大语言模型(LLMs)和现有的视觉语言模型(VLMs)的特性。我们利用VLM的文本描述模型生成每个测试视频每个帧的文本描述。然后,我们设计了一个提示机制,用于解锁LLMs在时间聚合和异常得分估计方面的能力,使LLMs成为有效的视频异常检测器。我们进一步利用模态对齐的VLMs,并提出基于跨模态相似性的有效技术,用于清洗嘈杂的文本和优化基于LLM的异常得分。我们在两个大型数据集(UCF-Crime和XD-Violence)上评估LAVAD,结果表明,在没有训练或数据收集的情况下,它优于无监督和一类别方法。

URL

https://arxiv.org/abs/2404.01014

PDF

https://arxiv.org/pdf/2404.01014.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot