Paper Reading AI Learner

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

2024-04-19 17:22:51
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi

Abstract

We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: this https URL.

Abstract (translated)

我们提出了Groma,一种具有 grounded 和 fine-grained 视觉感知能力的多模态大型语言模型(MLLM)。除了全局图像理解之外,Groma 还擅长诸如区域注释和视觉 grounding 之类的区域级别任务。这些能力基于局部视觉标记化机制,其中图像输入被分解成感兴趣的区域并随后编码成区域标记。通过将区域标记集成到用户指令和模型响应中,我们使 Groma 能够理解用户指定的区域输入,并将文本输出与图像 grounding 相结合。此外,为了增强 Groma 的 grounded chat 能力,我们利用 GPT-4V 的强大的视觉提示技术和视觉数据增强方法,收集了一个视觉 grounded 指令数据集。与依赖于语言模型或外部模块进行 localization 的 MLLM 相比,Groma 在标准参考和 grounding 基准测试中 consistently表现出卓越的性能,突显了将 localization 嵌入到图像标记化中的优势。项目页面:此链接:https:// this URL。

URL

https://arxiv.org/abs/2404.13013

PDF

https://arxiv.org/pdf/2404.13013.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot