Paper Reading AI Learner

TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

2025-10-08 14:56:42
Guo Yutong, Wanying Wang, Yue Wu, Zichen Miao, Haoyu Wang

Abstract

Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.

Abstract (translated)

表格视觉问答(Table Visual Question Answering,简称 Table VQA)通常由大规模的视觉语言模型(Vision-Language Models, VLMs)解决。虽然这类模型可以从图像中直接回答问题,但除非扩大到非常大的规模,否则它们往往会忽略细微的细节,这种规模在计算上是不可行的,尤其是在移动设备部署方面。一种更轻量级的方法是由小型VLM执行光学字符识别(OCR),然后使用大型语言模型(Large Language Model, LLM)对结构化输出(如Markdown表格)进行推理。然而,这些表示方法并不是专门为LLM优化的,并且仍然会引入大量的错误。 我们提出了TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription),这是一种轻量级框架,利用了表格的双表示形式。TALENT提示小型VLM生成OCR文本和自然语言叙述,然后将这些与问题一起结合供LLM进行推理使用。这样就重新定义了Table VQA为以LLM为中心的多模态推理任务,在此任务中,VLM充当感知叙述模块而不是单一求解器。 此外,我们构建了一个更具挑战性的Table VQA数据集——ReTabVQA,该数据集要求在表格图像上进行多步骤的定量推理。实验表明,TALENT使小型VLM-LLM组合能够在公共数据集和ReTabVQA上以显著较低的计算成本达到或超越单个大型VLM的表现。

URL

https://arxiv.org/abs/2510.07098

PDF

https://arxiv.org/pdf/2510.07098.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot