Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on eight representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at this https URL.
Abstract (translated)
最近,在大型视觉-语言模型(LVLM)方面取得的进展展示了其出色的多模态感知能力,引起了广泛关注。尽管已经出现了许多全面评估和专门任务评估的研究,但计算机视觉中基本的细粒度图像任务仍鲜有探索。为填补这一空白,我们引入了一个全面的细粒度评价基准——FG-BMK,包含349万个问题和332万张图片。我们的评估系统地从人类视角和机器视角两个方面考察了LVLMs在语义识别和细粒度特征表示能力上的表现。通过对八个具有代表性的LVLM/VLM进行广泛的实验,我们揭示了训练范式、模态对齐、扰动敏感性和细粒度类别推理对于任务性能的影响的关键发现。本研究为当前LVLM的局限性提供了关键见解,并为未来数据构建和模型设计以发展更先进的LVLMs提供了指导建议。我们的代码是开源的,可在[此处](https://example.com)获取(实际链接请根据实际情况填写)。
URL
https://arxiv.org/abs/2504.14988