Abstract
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.
Abstract (translated)
我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.
URL
https://arxiv.org/abs/2403.04473