Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

2024-04-11 17:56:05

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang

arXiv_CV

arXiv_CV Image_Caption Caption Language_Model Pose LLM

Abstract
Abstract (translated)
URL
PDF

Abstract

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Abstract (translated)

尽管Ferret无缝地将区域理解融入大型语言模型（LLM），以促进其指称和定根能力，但仍然存在某些限制：受到预训练固定视觉编码器的约束，在更广泛的任务上表现不佳。在这项工作中，我们揭示了Ferret-v2，这是Ferret的重大升级，包括三个关键设计。（1）任何分辨率定根和指称：一种轻松处理更高图像分辨率的方法，提高模型对图像的详细处理和理解能力。（2）多粒度视觉编码：通过整合附加的DINOv2编码器，模型更好地学习全球和细粒度视觉信息的不同 underlying contexts。（3）三阶段训练范式：除了图像捕捉对齐之外，还提出了一个 high-resolution dense 对齐阶段，在最终指令调整之前。实验证明，Ferret-v2 提供了比Ferret和其他最先进方法更大的改进，得益于其高分辨率扩展和细粒度视觉处理。

URL

https://arxiv.org/abs/2404.07973

PDF

https://arxiv.org/pdf/2404.07973.pdf

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF