Paper Reading AI Learner

iRAG: An Incremental Retrieval Augmented Generation System for Videos

2024-04-18 16:38:02
Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar

Abstract

Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging. To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.

Abstract (translated)

检索增强生成(RAG)系统将自然语言生成和信息检索的优势相结合,为许多现实应用提供了动力,如聊天机器人。使用RAG对多模态数据(如文本、图像和视频)的联合理解具有吸引力,但有两个关键限制:一次性的、一次性的捕捉大型多模态数据中的所有内容意味着处理时间很高,而且通常 rich 多模态数据中的信息并不都在文本描述中。由于用户查询不知道,因此为多模态到文本转换和交互式查询大型多模态数据开发系统具有挑战性。为了克服这些限制,我们提出了 iRAG,它通过新的增量工作流程增强了 RAG,以实现对大型多模态数据集的交互式查询。与传统 RAG 不同,iRAG 快速索引大型多模态数据库,并且在增量工作流程中,它使用索引从多模态数据的部分部分非文本描述中主动提取更多详细信息以检索与交互式用户查询相关的上下文。这种增量工作流程避免了长多模态到文本转换时间,克服了信息损失问题,确保了交互式用户查询的响应具有高质量,而这些查询往往不知道。据我们所知,iRAG 是第一个用增量工作流程增强 RAG 的系统,以支持对大型、现实世界多模态数据的 efficient 交互式查询。在现实世界的长视频中进行实验结果表明,与传统 RAG 相比,视频到文本的 ingestion 速度提高了 23 到 25 倍,同时保证交互式用户查询的响应质量与传统 RAG 中的所有视频数据在查询之前转换为文本的响应质量相当。

URL

https://arxiv.org/abs/2404.12309

PDF

https://arxiv.org/pdf/2404.12309.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot