Abstract
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.
Abstract (translated)
为了回答超过10k字的长文档问题,文档问题回答(DocQA)旨在解决这样的问题。它们通常包含如章节、子章节和段落分界符等内容的结构。然而,长文档的索引方法仍然鲜被探索,而现有的系统通常采用固定长度的片段。由于它们没有考虑内容结构,因此产生的片段可能排除关键信息或包含无关内容。为了激励这一点,我们提出了多视角内容感知索引(MC-indexing)来通过(i)将文档结构化文档划分为内容片段,和(ii)在每个内容片段上表示原始文本、关键词和摘要视图来更有效地解决长文档的DocQA。我们强调,MC-indexing不需要训练或微调。具有可插拔和可定制功能,它可以轻松地与任何检索器集成,提高它们的性能。此外,我们还提出了一个包含不仅问题与答案对,还包括文档结构和答案范围的长的文档问题回答数据集。与最先进的片段化方案相比,MC-indexing显著增加了通过top k=1.5,3,5,和10的召回度分别为42.8%,30.0%,23.9%和16.3%。这些提高的分数是通过广泛实验得到的8个常用检索器的平均值(2个稀疏和6个密集)。
URL
https://arxiv.org/abs/2404.15103