Abstract
While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
Abstract (translated)
虽然现有的生成模型和统一模型在通用图像生成方面表现出色,但在需要深度推理、规划以及超出一般场景的精确数据到视觉映射能力的任务上却显得力不从心。为了突破现有局限,我们提出了一项新的具有挑战性的任务:创意表格可视化,要求模型能够根据给定的表格数据生成既准确又美观的信息图表。 为了解决这一挑战,我们提出了ShowTable管道,它通过逐步自我修正过程将多语言大模型(MLLMs)与扩散模型协同工作。在这个过程中,MLLM充当中央调度器进行视觉规划和判断视觉错误以提供精炼的指令,而扩散模型则执行MLLM发出的命令,从而实现高保真度结果。 为了支持该任务以及我们的管道,我们引入了三个自动化的数据构建流程用于训练不同的模块。此外,我们还推出了TableVisBench,一个新的包含800个具有挑战性的实例、横跨五个评估维度的新基准,用以评估在这一任务上的性能表现。 实验表明,使用不同模型实现的我们的管道,在多个基线方法上取得了显著优势,这突显了其有效多模态推理、生成和错误校正的能力。
URL
https://arxiv.org/abs/2512.13303