Abstract
Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.
Abstract (translated)
大视觉语言模型(LVLMs)在从自然语言中引导视觉表示方面明显高效。最近的研究利用LVLMs来解决零散视觉异常检测(VAD)挑战,通过将图像与文本描述正常和异常情况的图像相结合,称为异常提示。然而,现有的方法依赖于静态异常提示,容易受到跨语义歧义的影响,并且优先考虑全局图像层面的表示,而忽略了关键的局部像素级图像到文本对齐,这是准确异常定位所必需的。在本文中,我们提出了ALFA,一种无需训练的解决方案,通过统一的模型来解决这些挑战。我们提出了一个运行时提示自适应策略,该策略首先生成有益的异常提示,以利用大型语言模型的能力。该策略通过针对每个图像的异常提示自适应得分机制来增强其跨语义歧义缓解效果。我们进一步引入了一种新细粒度对齐器,通过将全局图像到局部语义空间的图像文本对齐进行投影,实现精确的异常定位。对MVTec和VisA等具有挑战性的数据集的广泛评估证实了ALFA在利用语言潜力进行零散VAD方面的有效性,与最先进的零散VAD方法相比,其性能提高了12.1%的MVTec AD和8.9%的VisA。
URL
https://arxiv.org/abs/2404.09654