Abstract
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.
Abstract (translated)
文档问答(QA)在理解视觉丰富文档(VRD)方面具有挑战性,尤其是那些大量包含文本内容的研究期刊文章。现有研究主要集中在真实世界文档稀疏文本的背景下,而理解多个页面之间的复杂语义关系仍然是一个挑战,以定位多模态组件。为了填补这一空白,我们提出了PDF-MVQA,专门针对研究期刊文章,包括多个页面和多模态信息检索。与传统机器阅读理解(MRC)任务不同,我们的方法旨在检索包含答案或视觉丰富文档实体(如表格和图)的完整段落。我们的贡献包括引入了一个全面的PDF文档VQA数据集,使人们可以研究文本主导文档的语义层次结构。我们还提出了新的VRD-QA框架,旨在同时抓住文档布局中的文本内容和关系,扩展了页级别理解,直至整个多页文档。通过这项工作,我们旨在增强现有视觉和语言模型在处理文本主导VRD-QA挑战中的能力。
URL
https://arxiv.org/abs/2404.12720