Abstract
Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.
Abstract (translated)
大多数生产级别的视觉问答(VQA)任务的部署仍然是构建处理流程的独立步骤,包括图像预处理、目标检测、光学字符识别(OCR)和(主要是监督)目标分类。然而,近期在视觉基础模型[25]和视觉语言模型(VLMs)[23]方面的进展引发了一个问题,即这些自定义训练、多步骤方法是否可以被替换为预训练、单步骤VLMs。本文在生产级别场景下分析各种VLMs的性能和局限性[5, 9, 12]。使用零售786k[10]数据集,我们研究了预训练VLMs在回答图像中广告产品的详细问题方面的能力。我们的研究包括两个商业模型GPT-4V[16]和GPT-4o[17],以及四个开源模型:InternVL[5],LLaVA 1.5[12],LLaVA-NeXT[13]和CogAgent[9]。我们最初的结果表明,开源模型和商业模型之间的性能差距通常并不大。然而,我们观察到VLM性能的强烈任务相关方差:虽然大多数模型能够高精度地回答关于产品品牌和价格的问题,但它们同时完全无法正确识别具体产品名称或折扣。这表明VLMs在解决细粒度分类任务和建模折扣更抽象概念方面存在问题。
URL
https://arxiv.org/abs/2408.15626