We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
我们介绍了Omni-RGPT,这是一种多模态大型语言模型,旨在促进图像和视频在区域级别的理解。为了实现空间-时间维度上的一致性区域表示,我们引入了Token Mark,即一组高亮视觉特征空间中目标区域的标记(token)。这些标记直接通过区域提示(例如框或掩码)嵌入到空间区域,并同时整合进文本提示以指定目标,从而建立了视觉和文本标记之间的直接联系。 为了进一步支持无需跟踪轨迹(video tracklets)就能稳健地理解视频,我们引入了一个辅助任务,该任务利用标记的一致性来指导Token Mark,在整个视频中实现稳定的区域解释。此外,我们还推出了一套大规模的基于区域级别的视频指令数据集(RegVID-300k)。 Omni-RGPT在图像和视频为基础的常识推理基准测试中取得了当前最佳的结果,并且在描述生成以及指称表达理解任务上也表现出强大的性能。
https://arxiv.org/abs/2501.08326
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
面部表情描述在各个领域得到了广泛应用。最近,视频多模态大型语言模型(MLLM)在通用视频理解任务中展现出巨大潜力。然而,在视频中描述面部表情对这些模型提出了两个主要挑战:(1) 缺乏足够的数据集和基准;以及 (2) 视频 MLLM 的视觉标记容量有限。为解决这些问题,本文介绍了一个新的遵循指令的数据集,专门用于动态面部表情描述。该数据集包含5,033个高质量视频片段,并且这些片段都经过了手动标注,总计超过70万个令牌。它的目的是提高视频 MLLM 辨别细微面部变化的能力。 此外,我们提出了 FaceTrack-MM 模型,它利用有限数量的标记来编码主要人物的脸部信息。此模型在跟踪脸部和聚焦于主要角色的表情方面表现出色,即使是在复杂的多人场景中也是如此。另外,我们还引入了一种新的评估指标,结合事件提取、关系分类以及最长公共子序列(LCS)算法来评价生成文本的内容一致性和时间顺序一致性。 除此之外,我们推出了 FEC-Bench,这是一个基准测试工具,用于评估现有视频 MLLM 在这一特定任务中的表现。所有数据和源代码都将公开提供。
https://arxiv.org/abs/2501.07978
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\% performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.
我们介绍了Tarsier2,这是一种最先进的大型视觉语言模型(LVLM),旨在生成详细且准确的视频描述,并展示出卓越的一般视频理解能力。通过三个关键升级,Tarsier2实现了显著的进步:(1) 将预训练数据从11M增加到40M个视频-文本对,丰富了数量和多样性;(2) 在监督微调期间执行细粒度的时间对齐;(3) 使用基于模型的采样自动构建偏好数据,并应用DPO(Dense Prediction Objective)训练进行优化。广泛的实验表明,在详细视频描述任务中,Tarsier2-7B在性能上持续优于包括GPT-4o和Gemini 1.5 Pro在内的领先专有模型。在DREAM-1K基准测试中,Tarsier2-7B相比GPT-4o将F1分数提高了2.8%,相较Gemini-1.5-Pro则提升了5.8%。在人类面对面的评估中,Tarsier2-7B相对于GPT-4o表现出8.6%的优势,而相较于Gemini-1.5-Pro则有高达24.9%的表现优势。此外,在涵盖视频问答、视频定位、幻觉测试和具身问答等任务的15个公开基准上,Tarsier2-7B均创下了新的最佳性能记录,展示了其作为强大的通用视觉语言模型的多功能性。
https://arxiv.org/abs/2501.07888
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.
音频视频分割(AVS)的本质在于定位和界定视频流中发出声音的对象。虽然基于Transformer的方法显示出巨大的潜力,但由于处理长距离依赖时的二次计算成本问题,它们在复杂场景中的表现受到了限制。为了解决这一瓶颈,并以线性复杂度实现复杂的多模态理解,我们提出了一种名为AVS-Mamba的选择性状态空间模型来解决AVS任务。我们的框架整合了两个关键组件用于视频理解和跨模态学习:Temporal Mamba Block(时序马曼块)用于顺序处理视频和Vision-to-Audio Fusion Block(视觉到音频融合块)用于高级音视集成。 基于此,我们开发了多尺度时间编码器,旨在增强不同尺度下对视觉特征的学习能力,促进帧内及跨帧信息的感知。为了实现多模态融合,我们提出了模式聚合解码器,利用Vision-to-Audio Fusion Block将视觉特性与音频特性在帧级和时序层面进行整合。 此外,我们采用情境集成金字塔来执行音视空间-时间上下文协作。通过这些创新性的贡献,我们的方法在AVSBench-object和AVSBench-semantic数据集上达到了新的业界领先水平。我们的源代码和模型权重可以在AVS-Mamba项目中获取。
https://arxiv.org/abs/2501.07810
Video causal reasoning aims to achieve a high-level understanding of videos from a causal perspective. However, it exhibits limitations in its scope, primarily executed in a question-answering paradigm and focusing on brief video segments containing isolated events and basic causal relations, lacking comprehensive and structured causality analysis for videos with multiple interconnected events. To fill this gap, we introduce a new task and dataset, Multi-Event Causal Discovery (MECD). It aims to uncover the causal relations between events distributed chronologically across long videos. Given visual segments and textual descriptions of events, MECD identifies the causal associations between these events to derive a comprehensive and structured event-level video causal graph explaining why and how the result event occurred. To address the challenges of MECD, we devise a novel framework inspired by the Granger Causality method, incorporating an efficient mask-based event prediction model to perform an Event Granger Test. It estimates causality by comparing the predicted result event when premise events are masked versus unmasked. Furthermore, we integrate causal inference techniques such as front-door adjustment and counterfactual inference to mitigate challenges in MECD like causality confounding and illusory causality. Additionally, context chain reasoning is introduced to conduct more robust and generalized reasoning. Experiments validate the effectiveness of our framework in reasoning complete causal relations, outperforming GPT-4o and VideoChat2 by 5.77% and 2.70%, respectively. Further experiments demonstrate that causal relation graphs can also contribute to downstream video understanding tasks such as video question answering and video event prediction.
视频因果推理旨在从因果角度实现对视频的高层次理解。然而,这种方法在范围上存在局限性,主要以问答模式进行操作,并且集中在包含孤立事件和基本因果关系的简短视频片段中,缺乏针对多事件相互关联视频进行全面结构化因果分析的能力。为填补这一空白,我们引入了一个新的任务和数据集——多事件因果发现(MECD)。该任务旨在揭示跨越长视频的时间轴上分布的事件之间的因果关系。基于视觉段落和事件的文字描述,MECD能够识别这些事件间的因果关联,并构建一个全面且结构化的事件级视频因果图,解释结果事件为何以及如何发生。 为了应对MECD中的挑战,我们设计了一个新颖的框架,灵感来源于格兰杰因果方法(Granger Causality),并结合了一种高效的基于掩码的事件预测模型来执行“事件格兰杰检验”(Event Granger Test)。通过对比掩盖前提事件和未掩盖前提事件时对结果事件的预测情况,该方法能够评估因果关系。此外,我们还整合了诸如前门调整法(front-door adjustment)和反事实推理等因果推断技术,以缓解MECD中的因果混淆问题以及虚假因果问题。同时,引入上下文链推理来进行更稳健且普遍化的推理。 实验验证了我们的框架在推理完整因果关系方面的有效性,相较于GPT-4o和VideoChat2分别提升了5.77%和2.70%的性能。进一步的实验证明,因果关系图还能为下游视频理解任务(如视频问答和事件预测)提供帮助。
https://arxiv.org/abs/2501.07227
With the rapid development of multimedia processing and deep learning technologies, especially in the field of video understanding, video quality assessment (VQA) has achieved significant progress. Although researchers have moved from designing efficient video quality mapping models to various research directions, in-depth exploration of the effectiveness-efficiency trade-offs of spatio-temporal modeling in VQA models is still less sufficient. Considering the fact that videos have highly redundant information, this paper investigates this problem from the perspective of joint spatial and temporal sampling, aiming to seek the answer to how little information we should keep at least when feeding videos into the VQA models while with acceptable performance sacrifice. To this end, we drastically sample the video's information from both spatial and temporal dimensions, and the heavily squeezed video is then fed into a stable VQA model. Comprehensive experiments regarding joint spatial and temporal sampling are conducted on six public video quality databases, and the results demonstrate the acceptable performance of the VQA model when throwing away most of the video information. Furthermore, with the proposed joint spatial and temporal sampling strategy, we make an initial attempt to design an online VQA model, which is instantiated by as simple as possible a spatial feature extractor, a temporal feature fusion module, and a global quality regression module. Through quantitative and qualitative experiments, we verify the feasibility of online VQA model by simplifying itself and reducing input.
随着多媒体处理和深度学习技术的迅速发展,特别是在视频理解领域,视频质量评估(VQA)取得了显著进展。尽管研究人员已经从设计高效的视频质量映射模型转向了各种研究方向,但对于在VQA模型中时空建模的有效性与效率之间的权衡深入探索仍然不足。鉴于视频具有高度冗余的信息,本文从联合空间和时间采样的角度探讨这一问题,旨在寻找一种方法:即在保证可接受的性能损失的情况下,我们最少可以保留多少信息以供输入到VQA模型中。 为此,我们在视频的空间维度和时间维度上大量抽样其信息,并将经过严重压缩的视频输入到一个稳定的VQA模型。在六个公开的视频质量数据库上进行了关于联合空间与时间采样的全面实验,结果表明,在丢弃大部分视频信息的情况下,VQA模型仍能保持可接受的表现。 此外,借助提出的联合空间和时间采样策略,我们首次尝试设计了一个在线VQA模型,该模型通过尽可能简单的空间特征提取器、时间特征融合模块以及全局质量回归模块来实现。通过定量和定性实验,我们验证了通过简化自身并减少输入,可以实现在线VQA模型的可行性。
https://arxiv.org/abs/2501.07087
Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings. Leveraging the advanced text processing capabilities of large language models (LLMs), X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs that mirror realistic daily activities in contextually rich scenarios. The video life-log durations span from 23 minutes to 16.4 hours. The evaluation of several baseline systems and multimodal large language models (MLLMs) reveals their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding and underscoring the need for more advanced models.
长时间自我中心视频理解提供了丰富的情境信息,并为长期人类行为提供了独特的见解,在具身智能、长期活动分析和个人化辅助技术的应用中具有巨大潜力。然而,现有的基准数据集主要侧重于单个的短时长视频或长达几十分钟的较长视频,这在评估大规模、超长时间自我中心视频记录方面存在显著差距。为解决这一问题,我们引入了X-LeBench,这是一个专门用于评估极长自我中心视频录制任务的新颖基准数据集。 通过利用大型语言模型(LLMs)先进的文本处理能力,X-LeBench开发了一种生活日志模拟管道,该管道生成与现实世界视频数据相一致的逼真、连贯的日计划。这种方法使得合成日计划能够灵活地与Ego4D的真实世界片段相结合——这是一个大规模自我中心视频的数据集,涵盖了广泛的日常生活场景,从而产生了432个模拟的生活日志视频,这些视频反映了在丰富情境中的现实日常活动。这些生活日志的时长从23分钟到16.4小时不等。 对几种基准系统和多模态大型语言模型(MLLMs)的评估显示它们整体表现不佳,这突显了长期自我中心视频理解所面临的固有挑战,并强调需要更先进的模型。
https://arxiv.org/abs/2501.06835
Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at \url{this https URL}.
尽管视频大型语言模型(VideoLLM)在各种任务中取得了进展,但在细粒度的时间理解方面,例如密集型视频字幕(Dense Video Captioning, DVC),它们仍然面临挑战。DVC是一项复杂的任务,旨在描述视频中的所有事件,并同时对其进行时间定位,这涉及包括视频分割、视频字幕生成和时间视频锚定在内的多项细粒度任务。以前的VideoLLM试图以一步到位的方式解决DVC问题,未能充分利用其推理能力。此外,针对VideoLLM的先前训练目标并未完全反映评估指标,因此没有提供直接与目标任务对齐的监督。 为了解决这些问题,我们提出了一种名为VidChain的新框架,它包括任务链(Chain-of-Tasks, CoTasks)和基于度量的直接偏好优化(Metric-based Direct Preference Optimization, M-DPO)。CoTasks将复杂任务分解成一系列子任务,使VideoLLM能够更有效地利用其推理能力。而M-DPO则让VideoLLM与评估指标对齐,并为每个任务提供细粒度的监督,这些监督与指标紧密相关。 我们将VidChain应用于两种不同的VideoLLM模型,在这两个模型上它始终能改善它们在细粒度视频理解方面的表现,从而在两个不同的DVC基准测试中超越了先前的VideoLLM模型,并且在时间视频锚定任务上也表现出色。代码可在[此链接](this https URL)获取。
https://arxiv.org/abs/2501.06761
Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at this https URL.
近期,视觉-语言模型取得了显著进展,在图像描述和视频理解等任务中表现出色。我们介绍了Valley2,这是一种新型的多模态大型语言模型,旨在增强所有领域的性能,并在电子商务和短视频场景的实际应用边界上进行扩展。值得注意的是,Valley2在电子商务基准测试中达到了最先进的(SOTA)水平,远超同类开源模型的表现(79.66 对比 72.76)。此外,在参数少于100亿的模型中,Valley2在OpenCompass排行榜上排名第二,并且以平均得分67.4的成绩表现出色。代码和模型权重可在[此处](https://this.http URL)开源获取。
https://arxiv.org/abs/2501.05901
The recent widespread adoption of drones for studying marine animals provides opportunities for deriving biological information from aerial imagery. The large scale of imagery data acquired from drones is well suited for machine learning (ML) analysis. Development of ML models for analyzing marine animal aerial imagery has followed the classical paradigm of training, testing, and deploying a new model for each dataset, requiring significant time, human effort, and ML expertise. We introduce Frame Level ALIgment and tRacking (FLAIR), which leverages the video understanding of Segment Anything Model 2 (SAM2) and the vision-language capabilities of Contrastive Language-Image Pre-training (CLIP). FLAIR takes a drone video as input and outputs segmentation masks of the species of interest across the video. Notably, FLAIR leverages a zero-shot approach, eliminating the need for labeled data, training a new model, or fine-tuning an existing model to generalize to other species. With a dataset of 18,000 drone images of Pacific nurse sharks, we trained state-of-the-art object detection models to compare against FLAIR. We show that FLAIR massively outperforms these object detectors and performs competitively against two human-in-the-loop methods for prompting SAM2, achieving a Dice score of 0.81. FLAIR readily generalizes to other shark species without additional human effort and can be combined with novel heuristics to automatically extract relevant information including length and tailbeat frequency. FLAIR has significant potential to accelerate aerial imagery analysis workflows, requiring markedly less human effort and expertise than traditional machine learning workflows, while achieving superior accuracy. By reducing the effort required for aerial imagery analysis, FLAIR allows scientists to spend more time interpreting results and deriving insights about marine ecosystems.
最近,无人机被广泛用于研究海洋动物,并为从空中影像中提取生物信息提供了新的机会。通过无人机获取的大量图像数据非常适合进行机器学习(ML)分析。然而,传统的开发方法是针对每个数据集训练、测试并部署一个新的模型,这需要大量的时间、人力和机器学习专业知识。 我们引入了一种名为Frame Level ALIgment and tRacking (FLAIR)的方法,它结合了Segment Anything Model 2 (SAM2)的视频理解能力和Contrastive Language-Image Pre-training (CLIP)的视觉语言能力。FLAIR接受无人机拍摄的视频作为输入,并输出整个视频中目标物种的分割掩模。特别值得注意的是,FLAIR采用了一种零样本方法,这意味着它不需要标注数据、训练新模型或对现有模型进行微调就能推广到其他物种。 在一组18,000张太平洋护士鲨无人机图像的数据集上,我们训练了最先进的物体检测模型与FLAIR进行了比较。结果表明,FLAIR大幅优于这些物体检测器,并且在提示SAM2的两种人机互动方法中表现出色,达到了Dice分数为0.81的良好效果。 此外,FLAIR能够轻松地推广到其他鲨鱼物种而不需要额外的人力投入,还能与新的启发式算法结合使用以自动提取包括长度和尾拍频率在内的相关信息。因此,FLAIR具有显著的潜力来加速空中影像分析工作流程,并且相较于传统的机器学习工作流程,它需要更少的人力和技术专长同时能够达到更高的准确性。 通过减少对空域图像分析的努力需求,FLAIR使科学家们有更多时间去解释结果并从海洋生态系统中获得见解。
https://arxiv.org/abs/2501.05717
Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. This limitation is particularly evident in ADL tasks, where understanding detailed human-object interaction and human-centric motion is crucial for applications such as elderly monitoring and cognitive assessment. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose an online ego2exo distillation approach to learn ego-augmented exo representations in LVLMs. While effective, this approach requires paired ego-exo training data, which is impractical to collect for real-world ADL scenarios. Consequently, we develop EgoMimic, a skeleton-guided method that can generate mimicked ego views from exocentric videos. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed EgoPerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at this https URL.
大型视觉语言模型(LVLMs)在视频理解方面展示了令人印象深刻的性能,但在日常生活活动(ADLs)中的应用却受到其捕捉细微互动和空间关系能力的限制。这一局限性尤其体现在需要详细了解人与物之间互动及以人为中心的动作的任务中,例如老年人监护和认知评估等应用场景。为解决此问题,我们旨在利用第一人称视角(egocentric views)的优势来增强LVLM对于外界视角(exocentric)ADL视频的理解能力。为此,我们提出了在线ego2exo蒸馏方法,用于在LVLM中学习增强型的exo表示。然而,该方法的有效实施需要配对的第一人称和外界视角训练数据,在实际ADL场景中收集这种数据是不切实际的。因此,我们开发了EgoMimic,这是一种基于骨架指导的方法,可以从外界视频生成模拟的第一人称视角视图。 我们的实验表明,增强型LVLM中的exo表示成功地学习到提取第一人称视角线索的能力,这一点通过在六个ADL基准测试和我们提出的专门用于评估从外界视频理解第一人称视角的EgoPerceptionMCQ基准上的全面评测得到验证。相关代码、模型及数据将开源发布在此链接:[提供URL]。 这种方法不仅提高了LVLM对于日常生活活动中复杂互动的理解能力,还为解决现实世界中ADL任务中的挑战提供了新的解决方案和研究途径。
https://arxiv.org/abs/2501.05711
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at this https URL.
时间感知能力,即根据问题提出时的时间戳进行动态推理的能力,是离线视频大型语言模型(LLM)和在线视频LLM之间的关键区别。与依赖完整视频来进行静态、事后分析的离线模型不同,在线模型增量处理视频流,并且能够基于提问的时间戳动态调整其响应。 尽管时间感知能力的重要性不言而喻,但在现有的基准测试中这一方面却没有得到足够的评估。为了填补这一空白,我们提出了OVO-Bench(Online-VideO-Benchmark),这是一个新的视频基准测试工具,强调了时间戳对于高级在线视频理解能力测评的重要性。OVO-Bench通过三种不同的场景来评估视频LLM在特定时间戳事件上的推理和响应能力: 1. 回溯分析:回顾过去的事件以回答问题。 2. 实时理解:理解和回应当前时间戳下发生的事件。 3. 前向积极应对:延迟响应,直到未来信息足够详细可以准确回答问题。 OVO-Bench包括了12项任务,涵盖了644个独特的视频和约2800条精细的人工标注元数据,这些元数据带有精确的时间戳。我们结合自动生成流水线与人工校准过程来创建高质量样本,并进一步开发了一个评估管道,以系统地沿着时间轴查询视频LLM。 对九种不同视频-LLM的评估表明,尽管在传统基准测试中有所进步,目前的模型在处理在线视频理解时仍面临重大挑战,其表现远不如人类代理。我们希望OVO-Bench能够推动视频LLM的进步,并激发未来关于在线视频推理的研究。我们的基准和代码可在提供的链接处访问。 该段文字介绍了OVO-Bench这一新的评估框架,它针对在线视频处理中的时间感知能力进行了专门设计和测试,为推进相关技术的发展提供了有力支持。
https://arxiv.org/abs/2501.05510
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model's performance in multimodal tasks. Experimental results demonstrate that LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as multimodal understanding, visual question answering, and video understanding, highlighting its broad application potential.
在这篇论文中,我们介绍了LLaVA-OCTOPUS,这是一种新颖的视频多模态大型语言模型。LLaVA-Octopus能够根据用户指令自适应地加权不同视觉投影器(projector)的特征,使我们可以利用每个投影器的独特优势。我们观察到,不同的视觉投影器在处理特定任务时表现出不同的特性。例如,有些投影器擅长捕捉静态细节,而另一些则更善于处理时间信息,还有一些更适合需要时间一致性任务的需求。通过根据用户指令动态调整特征权重,LLaVA-Octopus可以灵活选择并组合最合适的特征,从而显著提升模型在多模态任务中的性能。 实验结果表明,LLaVA-Octopus在多个基准测试中表现出色,尤其是在多模态理解、视觉问答和视频理解等任务上,突显了其广泛的应用潜力。
https://arxiv.org/abs/2501.05067
This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.
本文介绍了LongViTU,这是一个大规模的数据集(约121k个问答对,约900小时视频),该数据集是为长格式视频理解自动生成的。我们开发了一种系统方法,将视频组织成层次树结构,并引入自我修订机制以确保高质量的问答对。LongViTU中的每个问答对具有以下特点:1)长期上下文(平均证书长度为4.6分钟);2)丰富的知识和浓缩的推理能力(常识、因果关系、计划等);3)相关事件的具体时间戳标签。此外,LongViTU还作为一个基准,用于评估长格式视频和流媒体视频理解中的指令跟随性能。 我们对开源最先进的长视频理解模型LongVU以及商业模型Gemini-1.5-Pro在我们的基准测试上进行了评估。它们分别获得了49.9分和52.3分(以GPT-4评分标准为准),这突显了我们在建立的基准中的挑战性。进一步对LongVU进行监督微调(SFT)后,其在我们基准上的表现提高了12.0%,在分布内基准EgoSchema上提升了2.2%,在分布外基准VideoMME(长)、WorldQA和OpenEQA上分别提高了1.0%、2.2%和1.2%。这些结果表明LongViTU具有高质量的数据和强大的分布外泛化能力。
https://arxiv.org/abs/2501.05037
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
长篇视频理解面临的挑战是利用大型视觉语言模型分析在有限上下文窗口内分散但集中于空间的关键时刻。在这项工作中,我们引入了VideoMindPalace框架,该框架受到“心智宫殿”(Mind Palace)的启发,将关键视频时刻组织成一个拓扑结构化的语义图。 VideoMindPalace通过以下三个方面来组织关键信息: 1. **手部和物体跟踪及交互**:通过追踪视频中人手与物体之间的互动情况。 2. **聚类活动区域**:表示重复发生的特定活动区,以反映场景中的动态变化。 3. **环境布局映射**:将视频的物理空间结构化为图的形式。 这种方法允许大型语言模型进行自然语言解析,从而提供基于时间和空间上下文的语义理解。此外,我们提出了Video MindPalace基准(VMB),用于评估类似于人类的理解能力,包括空间定位、时间推理和布局感知序列理解。 在VMB以及已建立的视频问答数据集上进行测试,包括EgoSchema、NExT-QA、IntentQA和Active Memories Benchmark,VideoMindPalace展示了显著的空间-时间一致性提升和与人类一致性的推断能力,从而推动了视觉语言模型中长篇视频分析的能力。
https://arxiv.org/abs/2501.04336
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
随着多模态大型语言模型(MLLMs)的普及,自动驾驶领域迎来了新的机遇与挑战。特别是多模态视频理解对于交互式分析自动驾驶过程中可能发生的情况至关重要。然而,在这种动态场景中的视频往往包含复杂的时空运动模式,这限制了现有MLLMs在此领域的泛化能力。为解决这一问题,我们提出了一种新颖的分层蟒蛇自适应(Hierarchical Mamba Adaptation, H-MBA)框架,旨在适应自动驾驶视频中复杂的运动变化。 具体来说,H-MBA由两个不同的模块组成:上下文蟒蛇(Context Mamba, C-Mamba)和查询蟒蛇(Query Mamba, Q-Mamba)。首先,C-Mamba包含了各种类型的结构状态空间模型,能够有效地捕捉不同时间分辨率下的多粒度视频背景信息。其次,Q-Mamba可以灵活地将当前帧转换为可学习的查询,并且有选择性地从视频上下文中挑选出多个粒度的信息融入查询中。因此,它可以自适应地整合所有多尺度时间分辨率下的视频背景信息,从而增强视频理解能力。 通过在MLLMs中的即插即用范式,我们的H-MBA框架在自动驾驶的多模态视频任务上展示了出色的表现,例如,在风险物体检测方面,相较于之前的最先进方法(SOTA),实现了5.5% mIoU的性能提升。
https://arxiv.org/abs/2501.04302
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: this https URL .
近年来,视觉语言模型(VLMs)在视频理解方面取得了显著进展。然而,一个关键能力——细粒度运动理解,在当前的基准测试中仍被忽视。为了解决这一缺口,我们提出了MotionBench,这是一个全面的评估基准,旨在评估视频理解模型的细粒度运动理解能力。MotionBench 通过六个主要的以运动为导向的问题类型来评价模型在运动级别的感知,并包括从各种来源收集的数据,确保对现实世界视频内容的广泛表示。实验结果表明,现有的VLMs 在理解细粒度运动方面表现不佳。为了增强VLM在有限序列长度下捕捉细粒度运动的能力,我们进行了广泛的实验,审查了针对视频特征压缩优化的VLM架构,并提出了一种新颖且高效的Through-Encoder (TE) 融合方法。实验证明,高帧率输入和TE Fusion 在提升运动理解方面取得了改进,但仍有很大的改进空间。我们的基准测试旨在指导和支持更强大的视频理解模型的发展,强调细粒度运动理解的重要性。 项目页面:[此链接](this%20https%20URL)
https://arxiv.org/abs/2501.02955
Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specifically designed to enhance key aspects of the generated captions. Importantly, the framework operates solely during inference, eliminating the need for further training of the underlying captioning model. We evaluate the framework on various models and modalities, with a focus on audio captioning, and report promising results. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
大多数当前的字幕系统使用在特定环境(如基于亚马逊众包平台Mechanical Turk收集的图像数据)上训练的语言模型,这限制了它们适应其他模式分布和上下文的能力。这种局限性影响了音频或视频字幕等任务的表现,因为这些任务需要不同的语义线索。为了解决这一挑战,创建更加灵活且适用于各种现实场景的字幕框架至关重要。 在本研究中,我们提出了一种方法,通过将现有的字幕网络适应于不同设置的语义来改进其性能,例如,在音频字幕生成时描述声音及其来源。我们的框架由两个主要部分组成:(i) 一个冻结的语言模型(LM)组成的字幕系统;和(ii) 一个文本分类器,用于引导字幕系统的输出。该分类器是在GPT-4自动生成的数据集上训练的,并使用了专门设计来增强生成字幕关键方面的提示语。 特别重要的是,我们的框架仅在推理阶段运行,不需要对底层字幕模型进行额外训练。我们在不同模型和模式上评估了这一框架,并重点关注音频字幕的性能改进,取得了令人鼓舞的结果。值得注意的是,当与现有的零样本音频字幕系统结合使用时,我们的方法能够提高其质量并达到最先进的零样本音频字幕生成性能水平。
https://arxiv.org/abs/2501.03183
Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.
多模态大型语言模型由于在实际应用中展现出诸多潜力,已经成为深度视觉理解领域的热门话题。然而,一小时以上的长视频理解(包含数万帧视觉信息)仍是一个未被充分探索的领域,原因在于:1) 长期视频分析具有挑战性;2) 大型模型方法效率低下;3) 缺乏大规模基准数据集。在这篇论文中,我们专注于构建一个大规模的一小时长视频基准测试集HLV-1K,旨在评估长时间视频理解模型的效果。HLV-1K包含1009部一小时长度的高质量视频,并配有14,847组问题回答(QA)和多项选择题(MCQA),这些数据包含时间感知查询与多样化注释,涵盖了帧级别、事件内级别、跨事件级别以及长期推理任务。我们使用现有最先进的方法评估了我们的基准测试集,并展示了其在不同层次和各种任务中测试深度长时间视频理解能力的价值。这包括推动未来对长视频理解任务的细化研究,例如对于长时间直播视频、会议记录及电影等进行深入的理解分析。
https://arxiv.org/abs/2501.01645
The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.
最近出现的大规模语言模型(LLMs)通过视频大规模语言模型(VideoLLM)将复杂的推理能力引入了视频领域。然而,当前的VideoLLM依赖于单一的视觉编码器来进行所有视觉处理,这限制了可以传递给LLM的视觉信息的数量和类型。我们的方法MERV(多编码器视频表示),采用多个冻结的视觉编码器来创建视频的统一表示,为VideoLLM提供了全面的专业化视觉知识集合。通过在每个编码器之间进行时空对齐,我们能够解决更广泛范围内的开放式问题和多项选择题的视频理解挑战,并超越了先前的最佳工作。 MERV在标准的视频理解基准测试中比Video-LLaVA高出最多3.7%的准确率,同时其Video-ChatGPT得分也更高。此外,在零样本感知测试准确性方面,MERV比之前的最佳方法SeViLA提高了2.2%。MERV引入了少量额外参数,并且训练速度比等效的单编码器方法更快,同时实现了视觉处理的并行化。 最后,我们提供了定性证据证明,MERV成功地从其每一个编码器中捕获到了领域知识。我们的结果为利用多个视觉编码器实现全面视频理解提出了有希望的方向。
https://arxiv.org/abs/2501.01426