Paper Reading AI Learner

Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection

2024-05-08 03:13:20
Zhaoxiang Zhang, Hanqiu Deng, Jinan Bao, Xingyu Li

Abstract

Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.

Abstract (translated)

图像异常检测是计算机视觉领域的一个具有挑战性的任务。随着Vision-Language模型的出现,特别是基于CLIP的框架的出现,为零散 anomaly 检测带来了新的途径。最近的研究探索了通过将图像与正常和提示描述对齐来使用CLIP。然而,对文本指导的依赖往往不足,突显了添加额外视觉参考的重要性。在这项工作中,我们引入了一种称为Dual-Image Enhanced CLIP的方法,利用了联合视觉-语言评分系统。我们的方法处理成对图像,将每个图像作为另一个图像的视觉参考,从而丰富推理过程。这种双图像策略显著增强了异常分类和定位性能。此外,我们还通过引入测试时间自适应模块来加强我们的模型,该模块通过合成异常来优化定位能力。我们的方法明显地利用了视觉-语言联合异常检测的潜力,并在各种数据集上的性能与当前SOTA方法相当。

URL

https://arxiv.org/abs/2405.04782

PDF

https://arxiv.org/pdf/2405.04782.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot