Abstract
Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.
Abstract (translated)
表格视觉问答(Table Visual Question Answering,简称 Table VQA)通常由大规模的视觉语言模型(Vision-Language Models, VLMs)解决。虽然这类模型可以从图像中直接回答问题,但除非扩大到非常大的规模,否则它们往往会忽略细微的细节,这种规模在计算上是不可行的,尤其是在移动设备部署方面。一种更轻量级的方法是由小型VLM执行光学字符识别(OCR),然后使用大型语言模型(Large Language Model, LLM)对结构化输出(如Markdown表格)进行推理。然而,这些表示方法并不是专门为LLM优化的,并且仍然会引入大量的错误。 我们提出了TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription),这是一种轻量级框架,利用了表格的双表示形式。TALENT提示小型VLM生成OCR文本和自然语言叙述,然后将这些与问题一起结合供LLM进行推理使用。这样就重新定义了Table VQA为以LLM为中心的多模态推理任务,在此任务中,VLM充当感知叙述模块而不是单一求解器。 此外,我们构建了一个更具挑战性的Table VQA数据集——ReTabVQA,该数据集要求在表格图像上进行多步骤的定量推理。实验表明,TALENT使小型VLM-LLM组合能够在公共数据集和ReTabVQA上以显著较低的计算成本达到或超越单个大型VLM的表现。
URL
https://arxiv.org/abs/2510.07098