Paper Reading AI Learner

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

2024-03-29 16:26:20
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

Abstract

The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.

Abstract (translated)

人类与人工智能(AI)之间的互动是一个反映多模态大型语言模型(MLLM)有效性的关键因素。然而,目前的MLLM主要关注图像级的理解,并将交互限制为文本指令,从而限制了其在使用和响应深度上的灵活性。在本文中,我们介绍了Draw-and-Understand项目:一个新的模型、一个多领域数据集和一个挑战性的基准,用于视觉提示。具体来说,我们提出了SPHINX-V,一种新的端到端训练的Multimodal Large Language Model(MLLM),它连接了视觉编码器、视觉提示编码器和各种视觉提示(点、边界框和自由形状)以及语言理解。为了促进MLLM在视觉提示方面的研究,我们引入了MDVP-Data和MDVP-Bench。MDVP-Data是一个多领域数据集,包含1600万独特图像-视觉提示-文本指令样本,包括自然图像、文档图像、OCR图像、移动屏幕截图、网页屏幕截图和多面板图像。此外,我们还介绍了MDVP-Bench,这是一个全面而具有挑战性的基准,用于评估模型对视觉提示指令的理解能力。我们的实验结果表明,通过视觉提示,SPHINX-V的令人印象深刻的多模态交互能力得到了展现,揭示了在详细像素级别描述和问题回答能力方面的显著改进。

URL

https://arxiv.org/abs/2403.20271

PDF

https://arxiv.org/pdf/2403.20271.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot