Paper Reading AI Learner

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

2024-04-30 02:05:18
Yoonsik Kim, Moonbin Yim, Ka Yeon Song

Abstract

In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a \textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at \href{this https URL}{this https URL}.

Abstract (translated)

在本文中,我们建立了一个名为TableVQA-Bench的基准,该基准是从现有的表格问题回答(QA)和表格结构识别数据集中派生的。值得注意的是,现有的数据集没有包含图像或QA对,这两是表格问题回答的核心组成部分。因此,本文的主要目标是为了获得必要的组成部分。具体来说,图像是通过应用样式表或使用所提出的表格渲染系统来获得的。QA对是通过利用大型语言模型(LLM)生成的。最终,完成的TableVQA-Bench包括1,500个QA对。我们全面比较了各种多模态大型语言模型(MLLMs)在TableVQA-Bench上的性能。GPT-4V在实验中实现了最高精度,这是从我们的实验中商业和开源MLLM中的最高精度。此外,我们发现,在TableVQA性能中,视觉查询的数量对性能有很大的影响。为了进一步分析大型语言模型与LLM后端的性能差异,我们通过将图像格式表格和文本格式表格分别提供给MLLMs和LLMs进行了研究。我们的研究结果表明,处理视觉输入比处理文本输入更具挑战性,这可以从MLLM的较低性能中看出,尽管通常需要比LLM更高的计算成本。所提出的TableVQA-Bench和评估代码可在此处访问:<https://this https URL>

URL

https://arxiv.org/abs/2404.19205

PDF

https://arxiv.org/pdf/2404.19205.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot