Paper Reading AI Learner

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

2024-03-07 13:16:24
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai

Abstract

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.

Abstract (translated)

我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.

URL

https://arxiv.org/abs/2403.04473

PDF

https://arxiv.org/pdf/2403.04473.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot