Abstract
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here this https URL
Abstract (translated)
本文介绍了MiniGPT4-Video,一种专门为视频理解而设计的多模态大型语言模型(LLM)。该模型能够处理视频中的时间和文本数据,使其擅长理解视频的复杂性。在MiniGPT-v2的成功基础上,本文将模型的能力扩展到处理视频序列,使其能够理解视频。MiniGPT4-video不仅考虑视觉内容,还包含了文本对话,使模型能够有效回答涉及视觉和文本组件的查询。所提出的模型在MSVD、MSRVTT、TGIF和TVQA基准测试中的性能均优于现有最先进的方法,分别在MSVD、MSRVTT、TGIF和TVQA基准测试中取得了4.22%、1.13%、20.82%和13.1%的提高。我们的模型和代码都已公开发布,此处链接为https://。
URL
https://arxiv.org/abs/2404.03413