TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Abstract
Abstract (translated)
URL
PDF

Abstract

Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at this https URL.

Abstract (translated)

图表对于呈现和解释复杂数据关系非常重要。最近，多模态大型语言模型（MLLMs）在各种图表理解任务中表现出非凡的能力。然而，这些模型在参数和计算需求方面的庞大规模限制了其在资源受限环境中的应用。在本文中，我们提出了TinyChart，一个仅包含3B参数的高效的MLLM，用于图表理解。TinyChart克服了高效图表理解的两个关键挑战：（1）通过程序化思考（PoT）学习策略减少学习数值计算的负担，该策略训练模型生成用于数值计算的Python程序，（2）通过视觉词表合并模块减少高分辨率图像中产生的长视觉特征序列，该模块逐渐合并最相似的视觉词。大量实验证明，我们的3B TinyChart在包括ChartQA、Chart-to-Text、Chart-to-Table、OpenCQA和ChartX在内的各种图表理解基准测试中实现了最先进的性能。它优于拥有多达13B参数的ChartLlama和ChartAst等几个图表理解MLLM，并在ChartQA上的性能优于基于闭源通用MLLM GPT-4V。它还证明了其在推理过程中由于模型规模较小和视觉编码更高效而具有优越的效率。我们的代码和模型可以从该链接下载：https://url.com/

URL

https://arxiv.org/abs/2404.16635

PDF

https://arxiv.org/pdf/2404.16635.pdf

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Abstract

Abstract (translated)

URL

PDF Copy

PDF