Abstract
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
Abstract (translated)
视频异常检测(VAD)旨在在视频中查找异常事件。现有的工作主要依赖于训练深度模型来学习与视频级别监督、一类别监督或无监督设置相关的正态分布。基于训练的方法容易受到领域特定的影响,因此在实际部署中,任何领域变化都会涉及数据收集和模型训练,导致成本高昂。在本文中,我们彻底背离了以前的尝试,提出了基于语言的VAD(LAVAD),一种在无训练数据和无监督模型的情况下解决VAD的新方法,利用预训练的大语言模型(LLMs)和现有的视觉语言模型(VLMs)的特性。我们利用VLM的文本描述模型生成每个测试视频每个帧的文本描述。然后,我们设计了一个提示机制,用于解锁LLMs在时间聚合和异常得分估计方面的能力,使LLMs成为有效的视频异常检测器。我们进一步利用模态对齐的VLMs,并提出基于跨模态相似性的有效技术,用于清洗嘈杂的文本和优化基于LLM的异常得分。我们在两个大型数据集(UCF-Crime和XD-Violence)上评估LAVAD,结果表明,在没有训练或数据收集的情况下,它优于无监督和一类别方法。
URL
https://arxiv.org/abs/2404.01014