Abstract
In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at {this https URL}.
Abstract (translated)
近年来,在各种任务中,文本图像联合预训练技术取得了良好的效果。然而,在光学字符识别(OCR)任务中,将文本实例与图像中的相应文本区域对齐是一个挑战,因为它需要将文本和OCR-文本(将图像中的文本称为OCR-文本,以与自然语言中的文本区分开来)之间的有效对齐,而不是对整个图像内容的全面理解。在本文中,我们提出了一个名为OCR-Text Destylization Modeling(ODM)的新预训练方法,它基于文本提示将图像中发现的多样文本风格转移为统一风格。通过ODM,我们实现了文本和OCR-文本之间的更好对齐,并使预训练模型能够适应场景文本检测和标注任务的复杂和多样风格。此外,我们还针对ODM设计了一个新的标签生成方法,并将其与我们的Text-Controller模块相结合,以解决OCR任务中标注成本的挑战,允许更多的未标注数据参与预训练。在多个公开数据集上的广泛实验证明,我们的方法显著提高了性能,在场景文本检测和标注任务中优于当前预训练方法。代码可在此处下载:{this <https://URL>}。
URL
https://arxiv.org/abs/2403.00303