Paper Reading AI Learner

ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

2025-12-15 13:21:50
Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao, Yefei He, Kaixun Jiang, Chen-Wei Xie, Yun Zheng, Hongtao Xie

Abstract

While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.

Abstract (translated)

虽然现有的生成模型和统一模型在通用图像生成方面表现出色,但在需要深度推理、规划以及超出一般场景的精确数据到视觉映射能力的任务上却显得力不从心。为了突破现有局限,我们提出了一项新的具有挑战性的任务:创意表格可视化,要求模型能够根据给定的表格数据生成既准确又美观的信息图表。 为了解决这一挑战,我们提出了ShowTable管道,它通过逐步自我修正过程将多语言大模型(MLLMs)与扩散模型协同工作。在这个过程中,MLLM充当中央调度器进行视觉规划和判断视觉错误以提供精炼的指令,而扩散模型则执行MLLM发出的命令,从而实现高保真度结果。 为了支持该任务以及我们的管道,我们引入了三个自动化的数据构建流程用于训练不同的模块。此外,我们还推出了TableVisBench,一个新的包含800个具有挑战性的实例、横跨五个评估维度的新基准,用以评估在这一任务上的性能表现。 实验表明,使用不同模型实现的我们的管道,在多个基线方法上取得了显著优势,这突显了其有效多模态推理、生成和错误校正的能力。

URL

https://arxiv.org/abs/2512.13303

PDF

https://arxiv.org/pdf/2512.13303.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot