Paper Reading AI Learner

Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

2026-01-20 15:06:10
Haoran Xu, Yanlin Liu, Zizhao Tong, Jiaze Li, Kexue Fu, Yuyang Zhang, Longxiang Gao, Shuaiguang Li, Xingyu Li, Yanran Xu, Changwei Wang

Abstract

Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.

Abstract (translated)

分布外(OOD)检测是一项关键任务,近年来受到了广泛关注。CLIP模型的出现激发了大量关于零样本OOD检测的研究,这些研究通常采用无需训练的方法。目前的方法依赖于大型语言模型(LLMs)中的专家知识来识别潜在异常值,但它们往往过度依赖文本空间的知识,而忽视了在图像空间中检测分布外样本所面临的固有挑战。为此,在本文中我们提出了一种新型管道——MM-OOD,该方法利用多模态大语言模型的多模态推理能力和进行多轮对话的能力来增强异常值检测效果。我们的方法旨在提升近OOD和远OOD任务中的性能表现。具体来说: 1. 对于近OOD任务,我们将标准ID图像及相应的文本提示直接输入到多模态LLMs中以识别潜在异常; 2. 对于远OOD任务,我们引入了草图生成详细说明框架:首先使用文本提示进行分布外样本的草图绘制,然后生成对应的视觉OOD样本,并通过利用多模态提示来进一步详述。 实验结果表明,我们的方法在诸如Food-101等广泛使用的多模态数据集上取得了显著改进,同时验证了其在ImageNet-1K上的可扩展性。

URL

https://arxiv.org/abs/2601.14052

PDF

https://arxiv.org/pdf/2601.14052.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot