Paper Reading AI Learner

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

2024-04-25 14:23:24
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang

Abstract

Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at this https URL.

Abstract (translated)

图表对于呈现和解释复杂数据关系非常重要。最近,多模态大型语言模型(MLLMs)在各种图表理解任务中表现出非凡的能力。然而,这些模型在参数和计算需求方面的庞大规模限制了其在资源受限环境中的应用。在本文中,我们提出了TinyChart,一个仅包含3B参数的高效的MLLM,用于图表理解。TinyChart克服了高效图表理解的两个关键挑战:(1)通过程序化思考(PoT)学习策略减少学习数值计算的负担,该策略训练模型生成用于数值计算的Python程序,(2)通过视觉词表合并模块减少高分辨率图像中产生的长视觉特征序列,该模块逐渐合并最相似的视觉词。大量实验证明,我们的3B TinyChart在包括ChartQA、Chart-to-Text、Chart-to-Table、OpenCQA和ChartX在内的各种图表理解基准测试中实现了最先进的性能。它优于拥有多达13B参数的ChartLlama和ChartAst等几个图表理解MLLM,并在ChartQA上的性能优于基于闭源通用MLLM GPT-4V。它还证明了其在推理过程中由于模型规模较小和视觉编码更高效而具有优越的效率。我们的代码和模型可以从该链接下载:https://url.com/

URL

https://arxiv.org/abs/2404.16635

PDF

https://arxiv.org/pdf/2404.16635.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot