Paper Reading AI Learner

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

2024-04-12 15:18:25
Zhiwei Yang, Jing Liu, Peng Wu

Abstract

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

Abstract (translated)

弱监督视频异常检测(WSVAD)是一个具有挑战性的任务。通过基于弱标签生成细粒度伪标签,然后自训练分类器,这是一种有前景的解决方案。然而,由于现有方法仅使用RGB视觉模态,并且忽略了类文本信息的利用,因此限制了生成更精确伪标签的质量和影响了自训练的性能。受到基于事件描述的手动标注过程的启发,本文我们提出了一个基于Text Prompt with Normality Guidance(TPWNG)的新WSVAD伪标签生成和自训练框架。我们的想法是将预训练的对比语言-图像模型CLIP的丰富语言-视觉知识转移,以将视频事件描述文本和相应视频帧对齐生成伪标签。具体来说,我们首先通过设计两个排名损失和一个分布不协调损失来微调CLIP进行领域适应。然后,我们提出了一种可学习文本提示机制,通过正态性视觉提示进一步改善视频事件描述文本和视频帧的匹配准确性。接着,我们设计了一个基于正态性指导的伪标签生成模块,推断可靠的帧级伪标签。最后,我们引入了一个时间自适应学习模块,以更灵活和准确地学习不同视频事件的时序依赖关系。大量实验证明,我们的方法在两个基准数据集(UCF-Crime和XD-Viole)上的性能达到了最先进的水平。

URL

https://arxiv.org/abs/2404.08531

PDF

https://arxiv.org/pdf/2404.08531.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot