Abstract
Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.
Abstract (translated)
利用视觉文本表示语言建模领域正在迅速发展的前沿。在本文中,我们引入了一种新的预训练框架,用于一系列基于像素的自回归语言模型,在超过40亿个文档的RGB图像数据集上进行预训练。我们的方法的特点是具有双模态训练计划,通过具有回归头的下一个补丁预测视觉数据,并通过具有分类头的下一个单词预测文本数据。本研究特别关注语言中视觉和文本模态之间的协同作用。我们在各种基准测试中都进行了全面的评估,结果表明,视觉和文本模态的汇聚极大地增强了基于像素的语言模型的效果。值得注意的是,我们的研究结果表明,在训练过程中缺乏文本数据的单向像素模型可以与高级双向像素模型在各种语言理解基准测试中的性能水平相匹敌。这项工作突出了将视觉和文本信息集成到语言建模过程中的巨大潜力。我们将发布我们的代码、数据和检查点,以激发进一步的研究进展。
URL
https://arxiv.org/abs/2404.10710