Paper Reading AI Learner

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

2024-04-19 11:38:08
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang

Abstract

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Abstract (translated)

文本中心化的视觉问题回答(VQA)在多模态大型语言模型(MLLMs)的发展中取得了很大进展,然而,开源模型仍然与诸如GPT4V和Gemini这样的领先模型相疏离,部分原因是缺乏广泛的、高质量的教学数据调整。为此,我们介绍了一种用于创建大规模、高质量指令调整数据集的新方法,称为Square-10M,该数据集使用闭源的MLLM生成。数据构建过程被称为Square,包括四个步骤:自问自答、回答、推理和评估。我们使用Square-10M进行实验,得出了三个关键发现:1)我们的模型TextSquare在很大程度上超越了开源先前文本中心化MLLM,并在OCRBench(62.2%)上达到了新的标准。它甚至在前10个文本中心化基准中超过了顶级模型GPT4V和Gemini。2)此外,我们证明了VQA推理数据对于提供针对特定问题全面上下文的洞见至关重要。这不仅提高了准确性,而且显著减轻了幻觉。具体来说,TextSquare在四个通用VQA和幻觉评估数据集上的平均得分分别为75.1%,超过了前述领先模型的水平。3)值得注意的是,在扩展文本中心化VQA数据集的现象揭示了一个鲜艳的模式:指令调整数据量的指数增长与模型性能的提高直接相关,从而验证了数据集规模和Square-10M的高质量。

URL

https://arxiv.org/abs/2404.12803

PDF

https://arxiv.org/pdf/2404.12803.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot