MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Abstract
Abstract (translated)
URL
PDF

Abstract

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at this https URL.

Abstract (translated)

随着大型语言模型（LLMs）的成功，将视觉模型集成到LLMs中构建视觉语言基础模型近年来受到了越来越多的关注。然而，基于LLM的大型多模态模型（例如，Video-LLaMA，VideoChat）只能处理有限数量的帧短期视频理解。在本文中，我们主要关注设计一个高效有效的长期视频理解模型。我们不试图通过同时处理更多帧来超越现有工作的目的，而是提出了一种以在线方式处理视频并将过去视频信息存储在内存银行中的方法。这使得我们的模型在不需要超过LLMs的上下文长度限制或GPU内存限制的情况下，可以参考历史视频内容进行长期分析。我们的内存银行可以以一种无缝的方式与现有的多模态LLMs集成。我们对各种视频理解任务（如长期视频理解，视频问题回答和视频字幕）进行了广泛的实验，我们的模型可以在多个数据集上实现最先进的性能。代码可在此处下载。

URL

https://arxiv.org/abs/2404.05726

PDF

https://arxiv.org/pdf/2404.05726.pdf

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Abstract

Abstract (translated)

URL

PDF Copy

PDF