Paper Reading AI Learner

Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail

2024-08-28 08:25:41
Bianca Lamm, Janis Keuper

Abstract

Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.

Abstract (translated)

大多数生产级别的视觉问答(VQA)任务的部署仍然是构建处理流程的独立步骤,包括图像预处理、目标检测、光学字符识别(OCR)和(主要是监督)目标分类。然而,近期在视觉基础模型[25]和视觉语言模型(VLMs)[23]方面的进展引发了一个问题,即这些自定义训练、多步骤方法是否可以被替换为预训练、单步骤VLMs。本文在生产级别场景下分析各种VLMs的性能和局限性[5, 9, 12]。使用零售786k[10]数据集,我们研究了预训练VLMs在回答图像中广告产品的详细问题方面的能力。我们的研究包括两个商业模型GPT-4V[16]和GPT-4o[17],以及四个开源模型:InternVL[5],LLaVA 1.5[12],LLaVA-NeXT[13]和CogAgent[9]。我们最初的结果表明,开源模型和商业模型之间的性能差距通常并不大。然而,我们观察到VLM性能的强烈任务相关方差:虽然大多数模型能够高精度地回答关于产品品牌和价格的问题,但它们同时完全无法正确识别具体产品名称或折扣。这表明VLMs在解决细粒度分类任务和建模折扣更抽象概念方面存在问题。

URL

https://arxiv.org/abs/2408.15626

PDF

https://arxiv.org/pdf/2408.15626.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot