Harnessing Large Language Models for Training-free Video Anomaly Detection

Abstract
Abstract (translated)
URL
PDF

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Abstract (translated)

视频异常检测（VAD）旨在在视频中查找异常事件。现有的工作主要依赖于训练深度模型来学习与视频级别监督、一类别监督或无监督设置相关的正态分布。基于训练的方法容易受到领域特定的影响，因此在实际部署中，任何领域变化都会涉及数据收集和模型训练，导致成本高昂。在本文中，我们彻底背离了以前的尝试，提出了基于语言的VAD（LAVAD），一种在无训练数据和无监督模型的情况下解决VAD的新方法，利用预训练的大语言模型（LLMs）和现有的视觉语言模型（VLMs）的特性。我们利用VLM的文本描述模型生成每个测试视频每个帧的文本描述。然后，我们设计了一个提示机制，用于解锁LLMs在时间聚合和异常得分估计方面的能力，使LLMs成为有效的视频异常检测器。我们进一步利用模态对齐的VLMs，并提出基于跨模态相似性的有效技术，用于清洗嘈杂的文本和优化基于LLM的异常得分。我们在两个大型数据集（UCF-Crime和XD-Violence）上评估LAVAD，结果表明，在没有训练或数据收集的情况下，它优于无监督和一类别方法。

URL

https://arxiv.org/abs/2404.01014

PDF

https://arxiv.org/pdf/2404.01014.pdf

Harnessing Large Language Models for Training-free Video Anomaly Detection

Abstract

Abstract (translated)

URL

PDF Copy

PDF