Abstract
Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.
Abstract (translated)
理解视觉本位语言需要识别文本和视觉元素,并解释复杂布局。最先进的方法通常使用专门的预处理工具,如光学字符识别(OCR)系统,将文档图像输入映射到文本上下文中的提取信息,并且有时还使用大型语言模型(LLMs)进行文本词空间推理。然而,从外部工具和LLM获得的收益是以增加计算和工程复杂性为代价的。在本文中,我们询问是否小型的预训练图像到文本模型可以在端到端像素级视觉语言理解的 intermediate inference 步骤中学习选择性的文本或布局识别和推理。我们在训练数据上使用这些 OCR 工具、LLM 和更大多模态模型的输出作为中间 "推理",并训练了一个小学生模型,根据这些训练示例预测输入问题的推理和答案。基于Pix2Struct(282M参数)的学生模型在代表信息图、扫描文档和图形的三个视觉文档理解基准测试中实现了持续的改进,相对改进超过4%的绝对值。
URL
https://arxiv.org/abs/2311.09612