Multi-view Content-aware Indexing for Long Document Retrieval

Abstract
Abstract (translated)
URL
PDF

Abstract

Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.

Abstract (translated)

为了回答超过10k字的长文档问题，文档问题回答（DocQA）旨在解决这样的问题。它们通常包含如章节、子章节和段落分界符等内容的结构。然而，长文档的索引方法仍然鲜被探索，而现有的系统通常采用固定长度的片段。由于它们没有考虑内容结构，因此产生的片段可能排除关键信息或包含无关内容。为了激励这一点，我们提出了多视角内容感知索引（MC-indexing）来通过（i）将文档结构化文档划分为内容片段，和（ii）在每个内容片段上表示原始文本、关键词和摘要视图来更有效地解决长文档的DocQA。我们强调，MC-indexing不需要训练或微调。具有可插拔和可定制功能，它可以轻松地与任何检索器集成，提高它们的性能。此外，我们还提出了一个包含不仅问题与答案对，还包括文档结构和答案范围的长的文档问题回答数据集。与最先进的片段化方案相比，MC-indexing显著增加了通过top k=1.5，3，5，和10的召回度分别为42.8%，30.0%，23.9%和16.3%。这些提高的分数是通过广泛实验得到的8个常用检索器的平均值（2个稀疏和6个密集）。

URL

https://arxiv.org/abs/2404.15103

PDF

https://arxiv.org/pdf/2404.15103.pdf

Multi-view Content-aware Indexing for Long Document Retrieval

Abstract

Abstract (translated)

URL

PDF Copy

PDF