Paper Reading AI Learner

Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection

2024-04-15 10:42:22
Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu

Abstract

Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.

Abstract (translated)

大视觉语言模型(LVLMs)在从自然语言中引导视觉表示方面明显高效。最近的研究利用LVLMs来解决零散视觉异常检测(VAD)挑战,通过将图像与文本描述正常和异常情况的图像相结合,称为异常提示。然而,现有的方法依赖于静态异常提示,容易受到跨语义歧义的影响,并且优先考虑全局图像层面的表示,而忽略了关键的局部像素级图像到文本对齐,这是准确异常定位所必需的。在本文中,我们提出了ALFA,一种无需训练的解决方案,通过统一的模型来解决这些挑战。我们提出了一个运行时提示自适应策略,该策略首先生成有益的异常提示,以利用大型语言模型的能力。该策略通过针对每个图像的异常提示自适应得分机制来增强其跨语义歧义缓解效果。我们进一步引入了一种新细粒度对齐器,通过将全局图像到局部语义空间的图像文本对齐进行投影,实现精确的异常定位。对MVTec和VisA等具有挑战性的数据集的广泛评估证实了ALFA在利用语言潜力进行零散VAD方面的有效性,与最先进的零散VAD方法相比,其性能提高了12.1%的MVTec AD和8.9%的VisA。

URL

https://arxiv.org/abs/2404.09654

PDF

https://arxiv.org/pdf/2404.09654.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot