Abstract
Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
Abstract (translated)
图像异常检测是计算机视觉领域的一个具有挑战性的任务。随着Vision-Language模型的出现,特别是基于CLIP的框架的出现,为零散 anomaly 检测带来了新的途径。最近的研究探索了通过将图像与正常和提示描述对齐来使用CLIP。然而,对文本指导的依赖往往不足,突显了添加额外视觉参考的重要性。在这项工作中,我们引入了一种称为Dual-Image Enhanced CLIP的方法,利用了联合视觉-语言评分系统。我们的方法处理成对图像,将每个图像作为另一个图像的视觉参考,从而丰富推理过程。这种双图像策略显著增强了异常分类和定位性能。此外,我们还通过引入测试时间自适应模块来加强我们的模型,该模块通过合成异常来优化定位能力。我们的方法明显地利用了视觉-语言联合异常检测的潜力,并在各种数据集上的性能与当前SOTA方法相当。
URL
https://arxiv.org/abs/2405.04782