In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{this https URL}
在本文中,我们探讨了来自预训练的文本到视频(T2V)扩散模型生成视频理解任务的视觉表示。我们假设,从预训练的生成T2V模型中学习到的潜在表示包含了丰富的语义和连贯的时间对应关系,从而自然地促进视频理解。我们的假设通过经典的参考视频物体分割(R-VOS)任务得到验证。我们引入了一个名为“VD-IT”的新框架,它专为固定预训练的T2V模型设计了一些专用的组件。具体来说,VD-IT使用文本信息作为条件输入,确保了语义的一致性,并在精确的时间实例匹配方面保持了语义一致性。它还进一步增加了图像令牌作为补充的文本输入,从而丰富特征集,生成了详细和细微的掩码。 此外,我们提出了一种预测模块,用于预测视频特有噪声,这有助于保留特征保真度并提高分割质量。通过广泛的实验,我们惊讶地观察到,固定生成T2V扩散模型,与通常用于视频骨干的(如Video Swin Transformer)预训练方法相比,具有更好的保持语义对齐和时间一致性的潜力。在现有标准基准上,我们的VD-IT实现了极具竞争力的结果,超过了许多现有方法。代码将在\url{这个 https URL}中提供。
https://arxiv.org/abs/2403.12042
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
我们探讨了如何通过将大型语言模型和视觉语言模型与新颖的统一记忆机制相结合来解决具有挑战性的视频理解问题,特别是捕捉长时间视频中的长期时间关系。特别是,所提出的多模态代理VideoAgent:1)构建了一个结构化记忆来存储视频的通用时间事件描述和物体中心跟踪状态;2)在给定输入任务查询时,它采用包括视频段局部定位和物体记忆查询等其他视觉基础模型来交互式解决任务,利用LLM的零 shot工具使用能力。VideoAgent在多个长期时间范围的视觉理解基准测试中表现出色,与基线相比,平均提高了6.6%的NExT-QA和26.0%的EgoSchema,缩小了开源模型和私有模型之间的差距,包括Gemini 1.5 Pro。
https://arxiv.org/abs/2403.11481
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
长格式视频理解是计算机视觉领域的一个重要挑战,要求模型具有在长多模态序列上进行推理的能力。为了满足人类在长格式视频理解中的认知过程,我们强调交互式推理和规划,而不是处理长视觉输入的能力。我们引入了一个名为VideoAgent的新颖智能体系统,它采用一个大语言模型作为核心代理,通过迭代确定和汇总关键信息来回答问题,而视觉语言模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中评估,VideoAgent平均使用8.4和8.2帧,实现了54.1%和71.3%的零散准确性。这些结果表明,我们的方法在当前最先进的方法上具有优越的效性和效率,突出了基于智能体的方法在促进长格式视频理解方面的潜力。
https://arxiv.org/abs/2403.10517
Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: this https URL.
理解视频是计算机视觉研究的一个基本方向,致力于探索各种架构,如RNN、3D CNN和Transformer。新提出的状态空间模型(如Mamba)表现出在长序列建模领域延续其成功的前景,同时研究了Mamba在视频理解领域可能表现出优势的各种任务。在本文中,我们进行了一系列全面的研究,探讨了Mamba在建模视频中的不同作用,同时研究了Mamba在各种任务中可能表现出优势的多样性。我们将Mamba分为四种建模视频的角色,构建了一个由14个模型/模块组成的视频Mamba套件,并在12个视频理解任务上对其进行评估。我们广泛的实验揭示了Mamba在视频-仅和视频-语言任务方面的强大潜力,同时展示了其在效率和性能上的有益权衡。我们希望这项工作可以为未来关于视频理解的科学研究提供有价值的数据点和见解。代码是公开的:此链接。
https://arxiv.org/abs/2403.09626
We introduce a novel text-to-pose video editing method, ReimaginedAct. While existing video editing tasks are limited to changes in attributes, backgrounds, and styles, our method aims to predict open-ended human action changes in video. Moreover, our method can accept not only direct instructional text prompts but also `what if' questions to predict possible action changes. ReimaginedAct comprises video understanding, reasoning, and editing modules. First, an LLM is utilized initially to obtain a plausible answer for the instruction or question, which is then used for (1) prompting Grounded-SAM to produce bounding boxes of relevant individuals and (2) retrieving a set of pose videos that we have collected for editing human actions. The retrieved pose videos and the detected individuals are then utilized to alter the poses extracted from the original video. We also employ a timestep blending module to ensure the edited video retains its original content except where necessary modifications are needed. To facilitate research in text-to-pose video editing, we introduce a new evaluation dataset, WhatifVideo-1.0. This dataset includes videos of different scenarios spanning a range of difficulty levels, along with questions and text prompts. Experimental results demonstrate that existing video editing methods struggle with human action editing, while our approach can achieve effective action editing and even imaginary editing from counterfactual questions.
我们提出了一种新颖的文本到动作视频编辑方法,ReimaginedAct。虽然现有的视频编辑任务仅限于属性的变化、背景的变化和风格,但我们的方法旨在预测视频中开放性的人行动变化。此外,我们的方法可以接受不仅直接指令文本提示,还可以预测可能的行动变化。ReimaginedAct包括视频理解、推理和编辑模块。首先,LLM首先用于获得关于指令或问题的合乎情理的答案,然后用于(1)提示 grounded-SAM 产生相关人员的边界框(2)检索我们为编辑人类动作而收集的一组动作视频。检索到的动作视频和检测到的个体 then被用于改变原始视频中提取的动作。我们还采用了一个时间步混合模块来确保编辑的视频保留其原始内容,除非有必要进行修改。为了促进文本到动作视频编辑的研究,我们引入了一个新的评估数据集WhatifVideo-1.0。这个数据集包括不同难度的场景的视频,以及问题和文本提示。实验结果表明,现有的视频编辑方法在人类行动编辑方面遇到困难,而我们的方法可以从反事实问题实现有效的行动编辑,甚至实现想象编辑。
https://arxiv.org/abs/2403.07198
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models are available at this https URL.
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. 通过丰富的评估,揭示了 VideoMamba 的四个核心能力: (1)在没有大量数据预训练的情况下,在视觉域实现可扩展性; (2)即使存在微细的运动差异,也对短期动作具有很高的敏感性; (3)在长期视频理解方面优势明显,展示出传统基于特征的模型的重大改进; (4)与其他模态保持兼容,表明在多模态环境中具有稳健性。 通过这些独特的优势,VideoMamba 为视频理解设定了一个新的基准,为全面视频理解提供了可扩展且高效的解决方案。 所有代码和模型都可以在這個網址找到:https://www.imglab.org/~video_mamba/
https://arxiv.org/abs/2403.06977
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at this https URL.
在这项研究中,我们在大型视觉语言模型(LVLMs)中识别出效率低下的关注现象,尤其是在知名模型如LLaVA-1.5,QwenVL-Chat和Video-LLaVA中。我们发现,在流行LVLMs的深度层中,对视觉标记的注意力计算极度低效,表明相对于文本数据处理,需要更稀疏的方法。为此,我们引入了FastV,一种可插可用的方法,旨在通过在早期层学习自适应的注意力模式并剪裁后续层的视觉标记来优化计算效率。我们的评估表明,FastV在极大地降低计算成本(例如,LLaVA-1.5-13B的FLOPs减少45)的同时,保持广泛的图像和视频理解任务的优异性能。FastV的计算效率和性能权衡是高度可定制且帕累托效率的。它可以将13B参数模型的FLOPs压缩到比7B参数模型低的预算,同时仍保持卓越的性能。我们认为,FastV在边缘设备和商业模型中部署LVLMs具有实际价值。代码发布在https://这个URL上。
https://arxiv.org/abs/2403.06764
Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing. This survey critically examines the progression of text-to-video technologies, focusing on the shift from traditional generative models to the cutting-edge Sora model, highlighting developments in scalability and generalizability. Distinguishing our analysis from prior works, we offer an in-depth exploration of the technological frameworks and evolutionary pathways of these models. Additionally, we delve into practical applications and address ethical and technological challenges such as the inability to perform multiple entity handling, comprehend causal-effect learning, understand physical interaction, perceive object scaling and proportioning, and combat object hallucination which is also a long-standing problem in generative models. Our comprehensive discussion covers the topic of enablement of text-to-video generation models as human-assistive tools and world models, as well as eliciting model's shortcomings and summarizing future improvement direction that mainly centers around training datasets and evaluation metrics (both automatic and human-centered). Aimed at both newcomers and seasoned researchers, this survey seeks to catalyze further innovation and discussion in the growing field of text-to-video generation, paving the way for more reliable and practical generative artificial intelligence technologies.
文本转视频生成标志着生成人工智能领域快速发展的前沿,结合了文本到图像合成、视频字幕和文本指导编辑等方面的进步。本调查对文本转视频技术的演变进程进行了深入探讨,重点关注从传统生成模型向最先进的Sora模型的转变,强调了可扩展性和泛化性的发展。与之前的工作不同,我们提供了对这些模型的技术框架和进化路径的深入探讨。此外,我们深入探讨了这些模型的实际应用,并讨论了伦理和技术挑战,如无法进行多个实体处理、难以理解因果效应学习、无法理解物体交互、无法感知物体缩放比例和比例等问题,这些问题在生成模型中也是一个长期存在的问题。全面的讨论涵盖了使文本转视频生成模型成为人类辅助工具和世界模型的可能性,以及激发模型的不足之处,并总结未来改进的方向,主要集中在训练数据和评估指标(自动和中心化)。旨在吸引新手和资深研究人员,本调查旨在在一个快速发展的文本转视频生成领域催化进一步创新和讨论,为更可靠和实用的生成人工智能技术铺平道路。
https://arxiv.org/abs/2403.05131
Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e.,"where") in videos. Yet, knowing merely "where" is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., "what") from videos, associated with "where", is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating "where" and "what" for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting "where" and "what" for SMOT, opening up a new direction in tracking for video understanding. Our BenSMOT and SMOTer will be released.
目前的多对象跟踪(MOT)旨在预测视频中的目标(即“哪里”)的运动轨迹。然而,仅知道“哪里”在许多关键应用中是不够的。与视频的语义理解(如细粒度行为、互动和总摘要captions)相关联的“哪里”,在综合视频分析中具有很高的需求。因此,我们引入了语义多对象跟踪(SMOT),旨在估计目标运动轨迹并同时理解相关轨迹的语义细节,包括实例captions、实例互动和总视频captions,将“哪里”和“什么”与跟踪相结合。为了促进SMOT的探索,我们提出了BenSMOT,一个大规模语义多对象跟踪(SMOT)基准。具体来说,BenSMOT包括3,292个视频和151K帧,涵盖了各种人类语义跟踪的场景。BenSMOT为每个视频序列提供了目标的轨迹注释,以及与自然语言相关的实例captions、实例互动和总视频captions。据我们所知,BenSMOT是第一个公开可用的SMOT基准。此外,为了鼓励未来的研究,我们提出了一个名为SMOTer的新跟踪器,专门为SMOT进行设计和端到端训练,表现出良好的性能。通过发布BenSMOT,我们期望通过预测SMOT的“哪里”和“什么”来超越传统MOT,为视频理解开辟新的方向。我们的BenSMOT和SMOTer将发布。
https://arxiv.org/abs/2403.05021
Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.
人类对视频流的理解是自然的,我们几秒钟就可以理解发生了什么,物体之间的相关性以及预测未来即将发生的事情,一切都能在几秒钟内完成。我们相信 - 要有效地将这种全面感知传递给智能机器人 - 的一个重要角色是学会关联概念并抽象来自不同任务的知識,在学习新技能时协同利用它们。为了实现这一目标,我们寻求一个统一的方法来理解视频,该方法将共享的人类动作的时间建模与最小开销相结合,以支持多个下游任务并学习新技能时进行合作。然后我们提出了EgoPack解决方案,它创建了一个可以在下游任务中携带的技能集合,可以作为一个潜在的额外见解来源,作为机器人可以随身携带的技能包。我们展示了在我们的方法上进行实验的四个Ego4D基准测试的有效性和效率,超越了当前最先进的方法。
https://arxiv.org/abs/2403.03037
The development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In the face of these challenges, we propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals. Our approach stands out for its flexibility and scalability, making it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.
多模态模型的开发标志着机器在理解视频方面取得了显著的进步。这些模型在分析短视频方面表现出了前景。然而,当涉及到像电影这样的较长格式时,它们往往无法满足要求。主要障碍是缺乏高质量、多样化的视频数据,以及收集或标注这种数据所需的艰苦工作。面对这些挑战,我们提出了MovieLLM,一种专门为长视频创建合成、高质量数据的新框架。该框架利用GPT-4和文本到图像模型的力量生成详细的脚本和相应的图像。我们方法的灵活性和可扩展性脱颖而出,使其成为传统数据收集方法的优越替代品。我们的大量实验证实,MovieLLM生产的数据显著改善了多模态模型在理解复杂视频故事中的性能,克服了现有数据集的稀缺性和偏见限制。
https://arxiv.org/abs/2403.01422
We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.
我们提出了MM-AU,一种用于多模态事故视频理解的全新数据集。MM-AU包含11,727个野外自适应视角的事故视频,每个视频都有与时间对齐的文本描述。我们注释了超过2,230,000个物体框和58,650对基于视频的事故原因对,涵盖了58个事故类别。MM-AU支持各种事故理解任务,特别是多模态视频扩散,以理解安全驾驶的事故链。 与MM-AU一起,我们提出了一个针对安全驾驶感知(AdVersa-SD)的类推断事故视频理解框架。AdVersa-SD通过基于物体中心视频扩散(OAVD)方法进行视频扩散,该方法受到一个类推断CLIP模型的驱动。这个模型包括对比交互损失来学习与相应文本描述相对应的常见事故帧、近事故帧和事故类别的 pair co-occurrence,例如事故原因、预防建议和事故类别。 OAVD在视频生成过程中修复原始帧背景的内容,以找到某些事故的 dominant cause-effect chain。大量实验证实了AdVersa-SD的类推断能力和OAVD相对于最先进的扩散模型的优越性。此外,我们还为物体检测和事故原因回答提供了仔细的基准评估,因为AdVersa-SD依赖于精确的物体和事故原因信息。
https://arxiv.org/abs/2403.00436
The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.
数据的质量和注释 upper 限制了下游模型的质量。虽然存在大量的文本语料库和图像文本对,但高质量的视频文本数据很难收集。首先,手动标注需要花费更多的时间,因为需要一个标注者观看整个视频。其次,视频具有时间维度,由多个场景堆叠而成,展示了多个动作。因此,为了建立一个高质量的视频数据集,我们提出了一个利用多模态输入的自动方法,如文本视频描述、字幕和单个视频帧。具体来说,我们从公开的 HD-VILA-100M 数据集中挑选了 3800 万像素的高清视频。然后将它们分为语义一致的视频片段,并应用多个跨模态教师模型为每个视频获取注释。接下来,在最佳视频的每个视频手动选择后,对片段进行微调,然后在整个数据集上应用该模型以选择最佳注释。这样,我们获得了 70 亿个视频与高质量文本注释的配对。我们将该数据集称为 Panda-70M。我们在三个下游任务上评估了所提出的数据集的价值:视频标题、视频和文本检索以及文本驱动视频生成。在所有任务上,训练在提出数据上的模型得分都有很大提高。
https://arxiv.org/abs/2402.19479
It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best of both worlds contrast to black-box methods.
在处理复杂、多模态内容(如电视片段)进行问题回答是非常具有挑战性的。这部分原因在于当前的视频语言模型依赖于单模态推理,在长输入方面表现不佳,并且缺乏可解释性。我们提出TV-TREES,这是第一个多模态等价树生成器。TV-TREES成为一种视频理解方法,通过直接生成简单前提与视频之间蕴含关系的树,促进了解释性联合模态推理。然后,我们引入了多模态等价树生成的任务,以评估这类方法的表现。我们对具有挑战性的TVQA数据集的实验结果表明,我们的方法在完整视频片段上的可解释性、最先进的零 shot性能,证明了既存的黑盒方法的优点和缺点。
https://arxiv.org/abs/2402.19467
To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.
为解决由于序列输入中旧类别的隐见性导致的灾难性遗忘问题,基于相对简单分类任务的现有研究已经取得了一定的进展。然而,视频字幕是一个在多模态场景中更加复杂的任务,在增量学习领域中还没有被探索。在分析具有序列输入的视频时,我们最初提出了一个在类级增量学习上减轻灾难性遗忘的方法:多模态视频字幕(MCF-VC)。关于在宏观层面上有效维护良好性能,我们基于线性参数的掩码和特征级Fisher敏感性选择了Fine-grained Sensitivity Selection(FgSS)。此外,为了更好地约束老任务和新任务的特征级别知识,我们还创建了两个级联知识蒸馏(TsKD),能够在权衡老任务的同时学会新任务。具体来说,我们设计了两项蒸馏损失,分别约束语义注意力特征图的跨模态语义信息以及最终输出的文本信息,以便在学习和掌握新任务的同时保留老任务的模型间和内部风格知识。为了说明我们的模型抵抗遗忘的能力,我们设计了一个名为CIDER_t的指标来检测阶段遗忘率。我们对公共数据集MSR-VTT的实验结果表明,与不回放旧样本的情况下重新播放旧任务相比,所提出的方法显著地减轻了前任务的遗忘,并且在新的任务上表现出色。
https://arxiv.org/abs/2402.17680
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. Employing large language models (LLMs) for comprehending video becomes an emerging and promising method. However, this approach incurs high computational costs due to the extensive array of video tokens, experiences reduced visual clarity as a consequence of token aggregation, and confronts challenges arising from irrelevant visual tokens while answering video-related questions. To alleviate these issues, we present an Interactive Visual Adapter (IVA) within LLMs, designed to enhance interaction with fine-grained visual elements. Specifically, we first transform long videos into temporal video tokens via leveraging a visual encoder alongside a pretrained causal transformer, then feed them into LLMs with the video instructions. Subsequently, we integrated IVA, which contains a lightweight temporal frame selector and a spatial feature interactor, within the internal blocks of LLMs to capture instruction-aware and fine-grained visual signals. Consequently, the proposed video-LLM facilitates a comprehensive understanding of long video content through appropriate long video modeling and precise visual interactions. We conducted extensive experiments on nine video understanding benchmarks and experimental results show that our interactive visual adapter significantly improves the performance of video LLMs on long video QA tasks. Ablation studies further verify the effectiveness of IVA in long and short video understandings.
视频长距离理解是多媒体和人工智能领域的一个重要且持续的挑战。使用大型语言模型(LLMs)进行视频理解成为一个新兴且具有前景的方法。然而,由于视频标记的广泛范围,这种方法导致计算成本过高,视觉清晰度降低,而且在回答与视频相关的问题时,还会面临与相关视觉标记相关的挑战。为了减轻这些问题,我们在LLMs中引入了交互式视觉适配器(IVA),旨在通过与预训练的因果变换器(CAT)协同工作,将长视频转换为时间视频标记。然后将它们输入到LLMs中,带有视频指令。接下来,我们将IVA集成到LLMs的内部块中,以捕获指令感知和细粒度的视觉信号。因此,与视频相关的LLM通过适当的视频建模和精确的视觉交互,实现了对长视频内容的全面理解。我们在九个视频理解基准上进行了广泛的实验,实验结果表明,我们的交互式视觉适配器显著提高了视频LLMs在长视频问答任务上的性能。消融研究进一步验证了IVA在长和短视频理解方面的有效性。
https://arxiv.org/abs/2402.13546
Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: this https URL
大多数视频字幕模型都被设计为处理几秒钟的短视频片段,并输出描述低级视觉概念(例如物体、场景、原子动作)的文本。然而,大多数现实世界的视频持续数分钟或数小时,具有跨越不同时间粒度的复杂层次结构。我们提出了Video ReCap,一种递归视频字幕模型,可以处理具有显著不同长度的视频输入(从1秒钟到2小时),并在多个层次级别输出视频字幕。递归视频语言架构利用了不同视频层次结构之间的协同作用,可以高效处理长视频。我们使用一种级联学习培训方案来学习视频的层次结构,从对原子动作的片段级描述开始,然后关注段级描述,最后生成小时级视频的摘要。此外,我们还通过增加Ego4D-HCap数据集来扩展Ego4D,该数据集包含8,267个手动收集的远程视频摘要。我们的递归模型可以在不同层次级别灵活生成字幕,同时对于其他复杂的视频理解任务(如VideoQA on EgoSchema.Data)也具有实用性。数据、代码和模型可在此处下载:https://this URL。
https://arxiv.org/abs/2402.13250
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 30 out of 33 video understanding benchmarks.
我们提出了VideoPrism,一种通用的视频编码器,它采用一个冻干的模型来解决多样视频理解任务。我们在一个包含36M个高质量视频摘要对和582M个有噪声的并行文本的视频异质语料库上预训练VideoPrism。预训练方法通过全局局部蒸馏语义视频嵌入和词洗牌方案,在掩码自编码的基础上提高了性能,使VideoPrism能够主要关注视频模式,同时利用视频相关的宝贵文本。我们在四个广泛的视频理解任务上对VideoPrism进行了广泛测试,从网络视频问题回答到科学家的简历,在30个测试指标中实现了最先进的性能。
https://arxiv.org/abs/2402.13217
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
视频语言模型(VLMs)是大型语言模型(LLMs)的进步推动下,在视频理解领域取得新突破的。一个关键挑战是开发一种将视频内容封装为一系列具有代表性的标记符以与LLMs对齐的方法。在这项工作中,我们引入了Slot-VLM,一种旨在生成语义分解的视频标记符,以促进LLM推理。特别是,我们设计了一个SlowFast Slots模块,即SF-Slots,它自适应地从CLIP视觉编码器的密集视频标记符中汇总高分辨率低(慢)帧采样率。为了考虑到空间物体细节和时间变化的不同,SF-Slots采用双分支结构。慢分支关注从高空间分辨率低(慢)帧采样率的特征中提取物体中心性的标记符,强调详细物体信息。相反,快速分支专注于从高时间采样率低空间分辨率的特征中学习事件中心性的标记符。这些互补的标记符组合成了视觉上下文,为LLM提供输入以实现高效的问题回答。我们的实验结果证明了Slot-VLM的有效性,它实现了与视频问题回答领域最先进的性能。
https://arxiv.org/abs/2402.13088
Vast literature has compared the recordings of biological neurons in the brain to deep neural networks. The ultimate goal is to interpret deep networks or to better understand and encode biological neural systems. Recently, there has been a debate on whether system identification is possible and how much it can tell us about the brain computation. System identification recognizes whether one model is more valid to represent the brain computation over another. Nonetheless, previous work did not consider the time aspect and how video and dynamics (e.g., motion) modelling in deep networks relate to these biological neural systems within a large-scale comparison. Towards this end, we propose a system identification study focused on comparing single image vs. video understanding models with respect to the visual cortex recordings. Our study encompasses two sets of experiments; a real environment setup and a simulated environment setup. The study also encompasses more than 30 models and, unlike prior works, we focus on convolutional vs. transformer-based, single vs. two-stream, and fully vs. self-supervised video understanding models. The goal is to capture a greater variety of architectures that model dynamics. As such, this signifies the first large-scale study of video understanding models from a neuroscience perspective. Our results in the simulated experiments, show that system identification can be attained to a certain level in differentiating image vs. video understanding models. Moreover, we provide key insights on how video understanding models predict visual cortex responses; showing video understanding better than image understanding models, convolutional models are better in the early-mid regions than transformer based except for multiscale transformers that are still good in predicting these regions, and that two-stream models are better than single stream.
大量文献将大脑生物神经元的录音与深度神经网络的录音进行比较。最终目标是解释深度神经网络,或者更好地理解和编码生物神经系统。最近,关于系统识别是否可能以及它可以告诉我们关于大脑计算的多少,引起了争议。系统识别是否一个模型比另一个更有效地代表大脑计算。然而,之前的工作没有考虑时间维度,以及在大规模比较中,视频和动态(如运动)建模在深度网络中与这些生物神经系统之间的关系。为此,我们提出了一个系统识别研究,重点比较单张图像与视频理解模型与视觉皮层录音的关系。我们的研究包括两个实验设置:现实环境设置和模拟环境设置。研究还包括超过30个模型,与之前的工作不同,我们关注卷积模型与Transformer模型的区别,单流与双流,以及完全自监督与自监督视频理解模型。目标是捕捉到建模动态的更广泛架构。从神经科学的角度来看,这标志着第一次关于视频理解模型的 large-scale 研究。我们在模拟实验中的结果表明,系统识别可以在区分图像与视频理解模型方面达到一定程度。此外,我们提供了关于视频理解模型如何预测视觉皮层反应的关键见解;视频理解比图像理解模型更好,在早期和中期区域,卷积模型比Transformer 模型更好,而多尺度Transformer模型仍然在这些区域表现出色;双流模型比单流模型更好。
https://arxiv.org/abs/2402.12519