Paper Reading AI Learner

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

2023-11-29 08:27:00
Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, Lijuan Wang

Abstract

We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.

Abstract (translated)

我们提出了MM-Narrator,一种利用GPT-4进行多模态上下文学习的新系统,用于生成音频描述(AD)。与之前主要关注下游微调的短视频片段的方法不同,MM-Narrator在生成具有较长视频长度(甚至超过小时)的准确音频描述方面表现出色。这是通过所提出的具有记忆增强的生成过程实现的,该过程通过有效的注册和回忆机制有效地利用了短文本上下文和长时视觉记忆。这些上下文记忆涵盖了相关的历史信息,包括故事情节和角色身份,从而确保了故事连贯和角色为中心的准确音频描述。为了保持MM-Narrator的训练免费设计,我们进一步提出了基于复杂度的演示选择策略,通过少样本多模态上下文学习(MM-ICL)显著增强了其多步骤推理能力。在MAD评估数据集上的实验结果表明,MM-Narrator在大多数场景上都超越了现有的基于微调的方法和基于LLM的方法。此外,我们还介绍了第一个基于段的评估器,用于对递归文本生成进行评估。借助GPT-4的力量,这个评估器在各种可扩展维度上全面推理和标记AD生成性能。

URL

https://arxiv.org/abs/2311.17435

PDF

https://arxiv.org/pdf/2311.17435.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot