Paper Reading AI Learner

Multi-view Content-aware Indexing for Long Document Retrieval

2024-04-23 14:55:32
Kuicai Dong, Derrick Goh Xin Deik, Yi Quan Lee, Hao Zhang, Xiangyang Li, Cong Zhang, Yong Liu

Abstract

Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.

Abstract (translated)

为了回答超过10k字的长文档问题,文档问题回答(DocQA)旨在解决这样的问题。它们通常包含如章节、子章节和段落分界符等内容的结构。然而,长文档的索引方法仍然鲜被探索,而现有的系统通常采用固定长度的片段。由于它们没有考虑内容结构,因此产生的片段可能排除关键信息或包含无关内容。为了激励这一点,我们提出了多视角内容感知索引(MC-indexing)来通过(i)将文档结构化文档划分为内容片段,和(ii)在每个内容片段上表示原始文本、关键词和摘要视图来更有效地解决长文档的DocQA。我们强调,MC-indexing不需要训练或微调。具有可插拔和可定制功能,它可以轻松地与任何检索器集成,提高它们的性能。此外,我们还提出了一个包含不仅问题与答案对,还包括文档结构和答案范围的长的文档问题回答数据集。与最先进的片段化方案相比,MC-indexing显著增加了通过top k=1.5,3,5,和10的召回度分别为42.8%,30.0%,23.9%和16.3%。这些提高的分数是通过广泛实验得到的8个常用检索器的平均值(2个稀疏和6个密集)。

URL

https://arxiv.org/abs/2404.15103

PDF

https://arxiv.org/pdf/2404.15103.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot