Abstract
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
Abstract (translated)
尽管Ferret无缝地将区域理解融入大型语言模型(LLM),以促进其指称和定根能力,但仍然存在某些限制:受到预训练固定视觉编码器的约束,在更广泛的任务上表现不佳。在这项工作中,我们揭示了Ferret-v2,这是Ferret的重大升级,包括三个关键设计。(1)任何分辨率定根和指称:一种轻松处理更高图像分辨率的方法,提高模型对图像的详细处理和理解能力。(2)多粒度视觉编码:通过整合附加的DINOv2编码器,模型更好地学习全球和细粒度视觉信息的不同 underlying contexts。(3)三阶段训练范式:除了图像捕捉对齐之外,还提出了一个 high-resolution dense 对齐阶段,在最终指令调整之前。实验证明,Ferret-v2 提供了比Ferret和其他最先进方法更大的改进,得益于其高分辨率扩展和细粒度视觉处理。
URL
https://arxiv.org/abs/2404.07973