Paper Reading AI Learner

AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

2025-12-15 18:57:04
Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang

Abstract

Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.

Abstract (translated)

工业异常检测(IAD)由于正常参考样本的稀缺性和许多缺陷的细微、局部性质而变得困难。单一通过的视觉-语言模型(VLMs)通常会忽略小的异常情况,并且缺乏与标准正常模式进行比较的明确机制。我们提出了AgentIAD,这是一个工具驱动的代理框架,能够执行多阶段视觉检查。该代理配备了感知缩放器(PZ),用于局部细粒度分析和对比检索器(CR),在证据模棱两可时查询正常的样本。为了教授这些检测行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并分两个阶段训练模型:监督微调后跟强化学习。 这一过程的设计包括两部分奖励机制:一种是感知奖励,用于监督分类准确性、空间对齐和类型正确性;另一种是行为奖励,鼓励高效地使用工具。这些组件共同作用使模型能够通过逐步观察、缩放和验证来细化其判断。 AgentIAD在MMAD数据集上实现了新的最佳性能,分类准确率达到97.62%,超过了先前基于多模态大语言模型的方法,并且生成的检测痕迹透明可解释。

URL

https://arxiv.org/abs/2512.13671

PDF

https://arxiv.org/pdf/2512.13671.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot