Paper Reading AI Learner

Unified Multi-Modal Interleaved Document Representation for Information Retrieval

2024-10-03 17:49:09
Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

Abstract

Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.

Abstract (translated)

信息检索(IR)方法旨在针对给定查询识别相关的文档,这是由于其在各种自然语言任务中取得成功而备受关注。然而,现有的方法通常仅考虑文档中的文本信息,而忽略了文档可以包含多种形式的信息,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,阻止了它们捕捉到整个文档的上下文和段落之间的互动。我们认为,这两个限制导致了检索到的文档表示不是最优的。在本文中,为了应对这些限制,我们旨在通过将文档与不同形式的信息集成来生成更全面和细微的文档表示。具体来说,我们通过利用最近在视觉语言模型上取得的处理和整合文本、图像和表格统一格式和表示的能力来实现这一目标。此外,为了减轻将文档分割为段落所带来的信息损失,我们进一步将分割段落的表示合并为一个单独的文档表示,同时引入了重排策略来在必要时将相关段落的重排组合成一个单独的文档表示。然后,通过在各种信息检索场景中进行广泛的实验,包括文本和多模态查询,我们证明了我们的方法在很大程度上超过了相关基线,得益于在文档中考虑了多种形式的信息的统一处理。

URL

https://arxiv.org/abs/2410.02729

PDF

https://arxiv.org/pdf/2410.02729.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot