Abstract
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: this https URL.
Abstract (translated)
我们提出了Groma,一种具有 grounded 和 fine-grained 视觉感知能力的多模态大型语言模型(MLLM)。除了全局图像理解之外,Groma 还擅长诸如区域注释和视觉 grounding 之类的区域级别任务。这些能力基于局部视觉标记化机制,其中图像输入被分解成感兴趣的区域并随后编码成区域标记。通过将区域标记集成到用户指令和模型响应中,我们使 Groma 能够理解用户指定的区域输入,并将文本输出与图像 grounding 相结合。此外,为了增强 Groma 的 grounded chat 能力,我们利用 GPT-4V 的强大的视觉提示技术和视觉数据增强方法,收集了一个视觉 grounded 指令数据集。与依赖于语言模型或外部模块进行 localization 的 MLLM 相比,Groma 在标准参考和 grounding 基准测试中 consistently表现出卓越的性能,突显了将 localization 嵌入到图像标记化中的优势。项目页面:此链接:https:// this URL。
URL
https://arxiv.org/abs/2404.13013