Abstract
Multi-modal models have shown appealing performance in visual tasks recently, as instruction-guided training has evoked the ability to understand fine-grained visual content. However, current methods cannot be trivially applied to scene text recognition (STR) due to the gap between natural and text images. In this paper, we introduce a novel paradigm that formulates STR as an instruction learning problem, and propose instruction-guided scene text recognition (IGTR) to achieve effective cross-modal learning. IGTR first generates rich and diverse instruction triplets of <condition,question,answer>, serving as guidance for nuanced text image understanding. Then, we devise an architecture with dedicated cross-modal feature fusion module, and multi-task answer head to effectively fuse the required instruction and image features for answering questions. Built upon these designs, IGTR facilitates accurate text recognition by comprehending character attributes. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins. Furthermore, by adjusting the instructions, IGTR enables various recognition schemes. These include zero-shot prediction, where the model is trained based on instructions not explicitly targeting character recognition, and the recognition of rarely appearing and morphologically similar characters, which were previous challenges for existing models.
Abstract (translated)
多模态模型在视觉任务中最近表现出了吸引人的性能,因为指令指导训练激发了理解细粒度视觉内容的能力。然而,由于自然图像和文本图像之间的差距,当前方法无法直接应用于场景文本识别(STR)。在本文中,我们提出了一个新颖的范例,将STR表示为指令学习问题,并提出了指令指导场景文本识别(IGTR)以实现有效的跨模态学习。IGTR首先生成丰富多样的指令三元组<条件,问题,答案>,作为复杂文本图像理解的指导。然后,我们设计了一个具有专用跨模态特征融合模块的架构,以及多任务答案头,以有效地融合回答问题所需的指令和图像特征。在这些设计的基础上,IGTR通过理解字符属性来准确识别文本。在英语和中文基准上的实验表明,IGTR超越了现有模型的显著优势。此外,通过调整指令,IGTR可以实现各种识别方案。这些包括基于指令没有明确针对字符识别的模型训练,以及识别罕见的和形态相似的字符,这是现有模型的前挑战。
URL
https://arxiv.org/abs/2401.17851