Abstract
In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.
Abstract (translated)
在本文中,我们介绍了StrucTexTv2,一种有效的文档图像预训练框架,通过进行Masked Visual-Textual预测实现。该框架由两个自监督的预训练任务组成:基于文本区域级别的图像掩膜建模。 proposed method随机掩膜一些图像区域,根据文本单词的 bounding box坐标。我们预训练任务的目标是同时重建掩膜图像区域和相应的掩膜代币的像素。因此,预训练编码器可以比通常预测掩膜图像补丁更多的文本语义。与基于图像和文本modal建模的文档图像理解依赖于图像和文本的混合模型相比,StrucTexTv2只依赖于图像输入,并可能处理更多的应用场景,无需OCR预处理。在文档图像理解的主要基准任务的广泛实验证明了StrucTexTv2的有效性。它在各种下游任务,如图像分类、布局分析、表格结构识别、文档OCR和信息提取的端到端场景下实现 competitive 或甚至新的先进性能。
URL
https://arxiv.org/abs/2303.00289