Paper Reading AI Learner

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

2025-04-21 09:30:41
Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on eight representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at this https URL.

Abstract (translated)

最近,在大型视觉-语言模型(LVLM)方面取得的进展展示了其出色的多模态感知能力,引起了广泛关注。尽管已经出现了许多全面评估和专门任务评估的研究,但计算机视觉中基本的细粒度图像任务仍鲜有探索。为填补这一空白,我们引入了一个全面的细粒度评价基准——FG-BMK,包含349万个问题和332万张图片。我们的评估系统地从人类视角和机器视角两个方面考察了LVLMs在语义识别和细粒度特征表示能力上的表现。通过对八个具有代表性的LVLM/VLM进行广泛的实验,我们揭示了训练范式、模态对齐、扰动敏感性和细粒度类别推理对于任务性能的影响的关键发现。本研究为当前LVLM的局限性提供了关键见解,并为未来数据构建和模型设计以发展更先进的LVLMs提供了指导建议。我们的代码是开源的,可在[此处](https://example.com)获取(实际链接请根据实际情况填写)。

URL

https://arxiv.org/abs/2504.14988

PDF

https://arxiv.org/pdf/2504.14988.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot