Abstract
Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at this https URL
Abstract (translated)
大型语言模型(LLMs)的进步激发了研究探索它们扩展到视觉领域的热情。尽管最近的模型在生成抽象图像标题和进行自然对话方面表现出潜力,但它们在文本丰富的图像上的表现不佳。在本文中,我们提出了Contrastive Reading Model(Cream),这是一种新神经网络架构,旨在通过捕获通常被忽视的精细细节,提高LLMs的语言-图像理解能力。Cream将视觉和辅助编码器相结合,并借助Contrastive feature alignment技术进行补充,从而在文档图像内更有效地理解文本信息。因此,我们的目标是填补视觉和语言理解之间的差距,为更复杂的文档情报助理铺平道路。在不同任务上,例如在文档图像中的视觉问答,进行了严格的评估,证明了Cream作为视觉文档理解领域最先进的模型的有效性。我们提供了我们的代码库和新生成的数据集,在本网站上提供。
URL
https://arxiv.org/abs/2305.15080