Abstract
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
Abstract (translated)
工业异常检测(IAD)由于正常参考样本的稀缺性和许多缺陷的细微、局部性质而变得困难。单一通过的视觉-语言模型(VLMs)通常会忽略小的异常情况,并且缺乏与标准正常模式进行比较的明确机制。我们提出了AgentIAD,这是一个工具驱动的代理框架,能够执行多阶段视觉检查。该代理配备了感知缩放器(PZ),用于局部细粒度分析和对比检索器(CR),在证据模棱两可时查询正常的样本。为了教授这些检测行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并分两个阶段训练模型:监督微调后跟强化学习。 这一过程的设计包括两部分奖励机制:一种是感知奖励,用于监督分类准确性、空间对齐和类型正确性;另一种是行为奖励,鼓励高效地使用工具。这些组件共同作用使模型能够通过逐步观察、缩放和验证来细化其判断。 AgentIAD在MMAD数据集上实现了新的最佳性能,分类准确率达到97.62%,超过了先前基于多模态大语言模型的方法,并且生成的检测痕迹透明可解释。
URL
https://arxiv.org/abs/2512.13671