Paper Reading AI Learner

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

2025-01-04 04:59:33
Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, Guangyao Shi

Abstract

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in this https URL.

Abstract (translated)

多模态视觉语言模型(VLMs)在计算机视觉和自然语言处理的交叉领域中作为变革性技术崭露头角,使机器能够通过视觉和文本两种方式感知和理解世界。例如,CLIP、Claude 和 GPT-4V 等模型展示了强大的推理能力和对视觉及文本数据的理解能力,并且在零样本分类任务上超越了传统的单一模态视觉模型。尽管这些多模态视觉语言模型在研究中取得了快速进展,并在应用领域越来越受欢迎,但目前还缺少全面的综述性文章,特别是针对希望在其特定领域内利用 VLMs 的研究人员而言。为此,我们从以下几个方面系统地概述了 VLMs: 1. 过去五年(2019-2024)主要 VLM 模型的信息; 2. 这些 VLM 的主要架构和训练方法; 3. 流行的基准测试及评价指标的总结与分类; 4. 包括具身智能体、机器人技术及视频生成在内的应用案例; 5. 当前 VLM 所面临的主要挑战,如幻觉(hallucination)、公平性和安全性等。 详细资料包括论文和模型仓库链接可以在以下网址查阅:[此 URL](https://this-url.com)。

URL

https://arxiv.org/abs/2501.02189

PDF

https://arxiv.org/pdf/2501.02189.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot