Abstract
Unsupervised anomaly detection enables the identification of potential pathological areas by juxtaposing original images with their pseudo-healthy reconstructions generated by models trained exclusively on normal images. However, the clinical interpretation of resultant anomaly maps presents a challenge due to a lack of detailed, understandable explanations. Recent advancements in language models have shown the capability of mimicking human-like understanding and providing detailed descriptions. This raises an interesting question: \textit{How can language models be employed to make the anomaly maps more explainable?} To the best of our knowledge, we are the first to leverage a language model for unsupervised anomaly detection, for which we construct a dataset with different questions and answers. Additionally, we present a novel multi-image visual question answering framework tailored for anomaly detection, incorporating diverse feature fusion strategies to enhance visual knowledge extraction. Our experiments reveal that the framework, augmented by our new Knowledge Q-Former module, adeptly answers questions on the anomaly detection dataset. Besides, integrating anomaly maps as inputs distinctly aids in improving the detection of unseen pathologies.
Abstract (translated)
无监督异常检测通过将原始图像与仅基于正常图像的模型生成的伪健康重构图像相邻来识别潜在的病理性区域。然而,由于结果异常图的临床解释存在缺乏详细、可理解解释的挑战,这是一个具有挑战性的问题。近年来语言模型的进步表明,具有类似于人类理解能力和提供详细描述的能力。这引发了一个有趣的问题:\textit{语言模型如何被用于使异常图更具有可解释性?}据我们所知,我们第一个利用语言模型进行无监督异常检测,为我们构建了一个不同问题和不回答的问答 dataset。此外,我们提出了一个专门针对异常检测的多图像视觉问答框架,结合了各种特征融合策略来增强视觉知识提取。我们的实验表明,在将新知识 Q-Former 模块扩展到框架后,该框架能够恰当地回答异常检测数据集中的问题。此外,将异常图作为输入可以明显地改善未见疾病的检测。
URL
https://arxiv.org/abs/2404.07622