Egocentric videos provide a unique perspective into individuals' daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin.
自我中心视频提供了独特的方式来了解个体日常经历的独特视角,然而它们的非结构化特性为感知带来了挑战。在本文中,我们介绍了AMEGO,一种旨在增强对非常长自我中心视频理解的独特方法。受到人类从单一观看中保留信息的能力的启发,AMEGO专注于构建一个自包含的表示,从单个自我中心视频捕捉关键位置和物体交互。这个表示是语义无关的,无需重新处理整个视觉内容即可完成多个查询。此外,为了评估我们对非常长自我中心视频的理解,我们引入了新的Active Memories Benchmark (AMB),由EPIC-KITCHENS的20K个高度具有挑战性的视觉查询组成。这些查询涵盖了视频推理的不同级别(序列、并发和时间基础),以评估详细视频理解能力。我们在AMB上展示了AMEGO的改善性能,超过其他视频QA基线,优势明显。
https://arxiv.org/abs/2409.10917
Video annotation is a critical and time-consuming task in computer vision research and applications. This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. Our approach uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features, allowing annotators to efficiently explore and label large video datasets. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video. Our experiments on multiple datasets show the effectiveness and robustness of our pipeline across various scenarios. Moreover, we investigate the optimal configuration of HSNE parameters for different datasets. Our work provides a promising direction for scaling up video annotation efforts in the era of video understanding.
视频注释是计算机视觉研究中有趣和具有挑战性的任务,同时在应用领域中也有着广泛的应用。本文提出了一种新颖的注释管道,该管道利用预提取的特征和降维来加速视频注释过程。我们的方法使用层次随机邻接嵌入(HSNE)来创建视频特征的多尺度表示,使得注释者可以有效地探索和标注大型视频数据集。我们证明了与传统线性方法相比,注释努力显著提高,超过12小时的视频标注所需的点击量减少了近10倍。我们对多个数据集的实验结果表明,我们的管道在各种场景中都具有有效性和稳健性。此外,我们还研究了HSNE参数在不同数据集上的最优配置。我们的工作为在视频理解时代增加视频注释努力提供了有前途的方向。
https://arxiv.org/abs/2409.10641
The SoccerNet 2024 challenges represent the fourth annual video understanding challenges organized by the SoccerNet team. These challenges aim to advance research across multiple themes in football, including broadcast video understanding, field understanding, and player understanding. This year, the challenges encompass four vision-based tasks. (1) Ball Action Spotting, focusing on precisely localizing when and which soccer actions related to the ball occur, (2) Dense Video Captioning, focusing on describing the broadcast with natural language and anchored timestamps, (3) Multi-View Foul Recognition, a novel task focusing on analyzing multiple viewpoints of a potential foul incident to classify whether a foul occurred and assess its severity, (4) Game State Reconstruction, another novel task focusing on reconstructing the game state from broadcast videos onto a 2D top-view map of the field. Detailed information about the tasks, challenges, and leaderboards can be found at this https URL, with baselines and development kits available at this https URL.
soccernet 2024 挑战代表足球网络团队组织的一年一度的视频理解挑战。这些挑战旨在促进足球领域内的研究,包括直播视频理解、场地理解和对球员的理解。今年,挑战包括四个基于视觉的任务。(1)球动作检测,专注于准确地定位与球相关的足球动作的发生时间和位置,(2)密集视频标题,用自然语言描述直播内容并使用 anchored timestamps 锚定时间,(3)多视角犯规识别,专注于分析潜在犯规事件的多个视角来判断是否犯规以及其严重程度,(4)比赛状态重建,专注于将直播视频上的比赛状态从二维场地视图重建到场上。关于任务、挑战和排行榜的详细信息,请查阅此链接,基础研究和开发工具可在该链接找到。
https://arxiv.org/abs/2409.10587
Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model's understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform individual memory modeling for each event to establish intra-event contextual connections, thereby reducing information redundancy. Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos. Finally, we perform extensive experiments on various video understanding tasks and the results show that our model achieves state-of-the-art performances.
近年来,将视觉基础模型集成到大语言模型(LLMs)中以形成视频理解系统,吸引了广泛的关注。大多数现有的模型在整部视频中压缩多样语义信息,并将其输入LLMs以进行内容理解。虽然这种方法在短视频理解方面表现出色,但由于粗略压缩,可能导致长视频中多个事件信息的混合,导致信息冗余。因此,关键事件的语义可能会在庞大的信息中变得模糊,从而限制了模型的理解能力。为解决这个问题,我们提出了一个基于分层事件记忆的LLM(HEM-LLM)来更好地理解长视频。首先,我们设计了一种新颖的自适应序列分割方案,将长视频中的多个事件进行分割。这种方式,我们可以为每个事件进行单独的内存建模,以建立事件之间的上下文联系,从而减少信息冗余。其次,在建模当前事件的同时,我们压缩并注入之前事件的信息,以增强视频中的长期事件依赖关系。最后,我们对各种视频理解任务进行了广泛的实验,结果表明,我们的模型实现了最先进的性能。
https://arxiv.org/abs/2409.06299
We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery. While existing surgical VL models primarily rely on contrastive learning, we propose a more comprehensive approach to capture the intricate temporal dynamics and align video with language. VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations. To support this framework, we present GenSurg+, a carefully curated dataset derived from GenSurgery, comprising 17k surgical video clips paired with captions generated by GPT-4 using transcripts extracted by the Whisper model. This dataset addresses the need for large-scale, high-quality VL data in the surgical domain. Extensive experiments on benchmark datasets, including Cholec80 and AutoLaparo, demonstrate the efficacy of our approach. VidLPRO achieves state-of-the-art performance in zero-shot surgical phase recognition, significantly outperforming existing surgical VL models such as SurgVLP and HecVL. Our model demonstrates improvements of up to 21.5\% in accuracy and 15.7% in F1 score, setting a new benchmark in the field. Notably, VidLPRO exhibits robust performance even with single-frame inference, while effectively scaling with increased temporal context. Ablation studies reveal the impact of frame sampling strategies on model performance and computational efficiency. These results underscore VidLPRO's potential as a foundation model for surgical video understanding.
我们提出了VidLPRO,一种专门为机器人及腹腔镜手术设计的视频语言(VL)预训练框架。与现有的手术VL模型主要依赖对比学习不同,我们提出了更全面的方法来捕捉复杂的时间动态并将视频与语言对齐。VidLPRO将视频文本对比学习、视频文本匹配和掩码语言建模目标相结合,以学习丰富的VL表示。为了支持这一框架,我们展示了GenSurg+,一个由GenSurgery精心挑选的数据集,包括17k个手术视频片段与由Whisper模型提取的文本的配对。这个数据集在手术领域解决了大规模、高质量VL数据的需求。在基准数据集(包括Cholec80和AutoLaparo)上的广泛实验证明了我们方法的的有效性。VidLPRO在零散手术阶段识别上实现了最先进的性能,显著优于现有手术VL模型,如SurgVLP和HecVL。我们的模型在准确性和F1分数上显示了21.5%的改进,为该领域树立了新的基准。值得注意的是,VidLPRO在单帧推理时也表现出稳健的性能,而随着时间语境的增加,有效地扩展了其计算效率。消融研究揭示了帧采样策略对模型性能和计算效率的影响。这些结果强调了VidLPRO在手术视频理解中的潜在基础模型的可能性。
https://arxiv.org/abs/2409.04732
Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.
多模态大型语言模型(MLLMs)已经在各种图像-语言应用中显著提高了性能。最近,人们在尝试将图像预训练的MLLM适应视频相关任务方面表现出了浓厚兴趣。然而,大多数努力都集中在增强视觉编码器和解码器组件,而核心部分,大型语言模型(LLMs),仍然相对较少被探索。在本文中,我们提出了两种策略,通过改进LLMs的跨层关注度计算来提高其在视频理解任务中的能力。具体来说,第一种方法关注于使用具有时间感知双RoPE的旋转位置嵌入(RoPE)的增强,这为加强MLLM的时间建模能力,同时保留视觉和文本标记之间的相对位置关系。第二种方法涉及使用帧级块条件注意掩码(Frame-wise Block Causal Attention Mask)增强注意力掩码,这是一种简单而有效的方法,它拓宽了视觉标记之间的互动,并在视频帧之间维持了因果推理机制。基于这些提出的方法,我们将LaLAVA适应视频理解任务,将其命名为Temporal-Considered LaLAVA(TC-LLaVA)。我们的TC-LLaVA在仅通过视频相关数据集进行监督微调(SFT)的情况下,在不同视频理解基准测试中实现了最先进的性能。
https://arxiv.org/abs/2409.03206
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
扩展多模态大型语言模型的长上下文能力对于视频理解、高分辨率图像理解和多模态代理非常重要。这包括一系列系统优化,包括模型架构、数据构建和训练策略,特别是解决了诸如 \textit{在更多图像下表现下降} 和 \textit{高计算成本} 等挑战。在本文中,我们将模型架构适应为Mamba和Transformer模块的混合,利用多张图像之间的时空依赖关系进行数据构建,并采用渐进式训练策略。发布的模型LongLLaVA(长-上下文 large 语言助手)是第一个混合MLLM,它实现了效率与效果的更好平衡。LongLLaVA不仅在各种基准测试中实现了竞争力的结果,而且具有高吞吐量和高内存消耗。特别是,它能在单个A100 80GB GPU上处理近千张图像,为各种任务的广泛应用带来了有前景的应用前景。
https://arxiv.org/abs/2409.02889
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.
近年来,在大型视频语言模型的支持下,已经取得了在实时规划和详细交互方面显著的进步。然而,它们的高计算需求和缺乏注释数据集限制了它们在学术研究中的应用。在这项工作中,我们介绍了一个名为VideoLLaMB的新框架,它利用桥层中的时间记忆单元实现整个视频序列的编码,同时保留历史视觉数据,有效保护语义连续,并提高各种任务上的模型性能。这种方法包括循环记忆单元和SceneTilling算法,将视频分割为独立的语义单元以保留语义完整性。实验证明,VideoLLaMB在三个VideoQA基准测试中的表现明显优于其竞争者,其性能提高了5.5个点。在MVBench上,VideoLLaMB-7B的表现甚至超过了其前7B模型。值得注意的是,即使视频长度增加至8倍,它仍然保持了鲁棒性能,这与PLLaVA的表现非常相似。此外,我们的Needle in a Video Haystack (NIAVH)基准中的帧检索结果也进一步证明了VideoLLaMB在准确识别长时间视频中的特定帧方面表现出众。我们的SceneTilling算法还使流式视频字幕的生成变得直接,无需进行额外的训练。在效率方面,经过16帧训练的VideoLLaMB,在单个Nvidia A100 GPU上支持高达320帧,从而确保了高性能和性价比,为长时间视频语言模型在学术和实践应用中奠定了新的基础。
https://arxiv.org/abs/2409.01071
In recent years, unmanned aerial vehicles (UAVs) have played an increasingly crucial role in supporting disaster emergency response efforts by analyzing aerial images. While current deep-learning models focus on improving accuracy, they often overlook the limited computing resources of UAVs. This study recognizes the imperative for real-time data processing in disaster response scenarios and introduces a lightweight and efficient approach for aerial video understanding. Our methodology identifies redundant portions within the video through policy networks and eliminates this excess information using frame compression techniques. Additionally, we introduced the concept of a `station point,' which leverages future information in the sequential policy network, thereby enhancing accuracy. To validate our method, we employed the wildfire FLAME dataset. Compared to the baseline, our approach reduces computation costs by more than 13 times while boosting accuracy by 3$\%$. Moreover, our method can intelligently select salient frames from the video, refining the dataset. This feature enables sophisticated models to be effectively trained on a smaller dataset, significantly reducing the time spent during the training process.
近年来,无人机(UAVs)在支持灾害应急响应行动中分析高空图像方面发挥了越来越重要的作用。然而,当前的深度学习模型往往关注提高准确性,而忽视了UAV的有限计算资源。这项研究承认了在灾害应急响应场景中实时数据处理的重要性,并引入了一种轻量级且高效的无人机视频理解方法。我们的方法通过策略网络内的冗余部分识别冗余部分,并使用帧压缩技术消除冗余信息。此外,我们还引入了“站点点”的概念,该概念利用了序列策略网络中的未来信息,从而提高了准确性。为了验证我们的方法,我们使用了火灾FLAME数据集。与基线相比,我们的方法降低了计算成本超过13倍,同时提高了准确度3%。此外,我们的方法可以智能地选择视频中的显眼帧,从而优化数据集。这一特点使得复杂模型能够在更小的数据集上进行有效训练,显著减少了训练过程中所花费的时间。
https://arxiv.org/abs/2409.00510
Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models (MLLMs) have shown impressive video understanding capabilities, they tend to focus more on the semantic content of videos, often overlooking emotional stimuli. Hence, most existing MLLMs fall short in estimating viewers' emotional reactions and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning (VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness involves sampling video frames with events that are most likely to evoke viewers' emotions. Token-level awareness performs tube selection in the token space to make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering MLLMs' reasoning strengths towards emotional focus and thereby enhancing their affective reasoning ability. To thoroughly assess the effectiveness of VAR, we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments demonstrate its superiority in understanding viewers' emotional responses to videos and providing coherent and insightful explanations.
预测和推理一个人视频会让人感觉如何对发展具有社交智能的系统至关重要。尽管大型多模态语言模型(MLLMs)已经展示了令人印象深刻的视频理解能力,但它们往往更加关注视频的语义内容,忽视了情感刺激。因此,大多数现有的MLLMs在估计观众情感反应和提供合理的解释方面都存在不足。为解决这个问题,我们提出了StimuVAR,一个基于MLLMs的视频情感推理(VAR)框架。StimuVAR包括一个双层刺激感知机制:帧级感知和词级感知。帧级感知涉及对最可能引起观众情感的视频帧进行采样。词级感知在词空间中进行 tube选择,使MLLM集中关注情感触发的时空区域。此外,我们还创建了VAR指令数据来进行情感训练,将MLLMs的推理能力引导到情感关注上,从而增强其情感推理能力。为了全面评估VAR的有效性,我们提供了具有丰富指标的全面评估协议。StimuVAR是第一个基于MLLM的观众中心VAR方法。实验证明其在理解观众对视频的情感反应方面具有卓越的性能,并提供清晰、深刻的解释。
https://arxiv.org/abs/2409.00304
While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: this https URL.
尽管现有研究通常将长视频视为扩展的短视频,但我们提出了一种更准确地反映人类认知的新方法。本文介绍了BREASE:长视频理解和边缘填充模型,该模型通过模拟集会记忆累积来捕捉动作序列,并通过在整个视频中分散的语义知识来加强它们。我们的工作做出了两个关键贡献:第一,我们开发了一个集会压缩器(ECO),它有效地从微到半宏观级别汇总关键表示。第二,我们提出了一个语义重新接收器(SeTR),它通过关注更广泛的上下文来增强这些汇总的代表,从而显著减少了特征维度,同时保留了相关信息的高级水平信息。大量的实验证明,BREASE在零散和完全监督设置中的长视频理解基准中均取得了最先进的性能。项目页和代码可在此处访问:https://this URL。
https://arxiv.org/abs/2408.17443
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in this https URL and this https URL, contributing to the advancement of the field.
从VisualGLM和CogVLM开始,我们不断探索用于增强视觉-语言融合、高效的更高分辨率架构以及更广泛的应用程序的变压机(VLMs)。在这里,我们提出了CogVLM2家族,这是一种新的用于图像和视频理解的视觉语言模型,包括CogVLM2、CogVLM2-Video和GLM-4V。作为图像理解模型,CogVLM2继承了视觉专家架构,在预训练和后训练阶段都有所改进,支持输入分辨率高达1344×1344像素。作为视频理解模型,CogVLM2-Video集成了多帧输入和时间戳,并提出了自动时间轴数据构建。值得注意的是,CogVLM2家族在如MMBench、MM-Vet、TextVQA、MVBench和VCGBench等基准测试中都取得了最先进的结果。所有模型都已开源,您可以在此链接https://github.com/facebookresearch/cogvLM2,以及此链接https://github.com/facebookresearch/cogvLM2-video,为该领域的发展做出了贡献。
https://arxiv.org/abs/2408.16500
In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Model or pre-training the model designed for the specific task, incurs a lot of overhead. How to make these models play other values than their own tasks becomes a worthy question. Multi-Task Learning(MTL) makes the visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It's different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.
近年来,视频理解任务的骨架参数继续增加,甚至达到亿级别。无论是在视频基础模型上对特定任务进行微调,还是为特定任务预训练模型,都会产生很多开销。如何使这些模型在执行任务时不仅关注自己的任务,而是获得其他任务的丰富可共享知识,成为了一个值得探讨的问题。多任务学习(MTL)使得视觉任务从其他任务中获取丰富的可共享知识,并在联合训练过程中进行。它已经在图像识别任务和密集预测任务中得到了充分探索。然而,由于缺乏多标签视频数据,在视频领域中很少使用。在本文中,我们提出了一种异质数据视频多任务提示学习(VMTL)方法,以解决上述问题。与图像领域不同,我们提出了一个双层映射(DLM),用于提取可共享知识,并将其与主要任务的表示对齐。大量的实验证明,我们的DLM-VMTL在6个不同的视频理解任务和11个数据集上的表现都优于基线。
https://arxiv.org/abs/2408.16195
Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.
快速地在扩展大型语言模型(LLMs)到大型多模态模型(LMMs)方面取得了进展。然而,将LLMs的输入模态扩展到视频数据仍然具有挑战性,特别是对于长视频。由于缺乏访问大型高质量视频数据,以及视觉特征的过度压缩,现有方法在处理长视频方面存在局限性。在本文中,我们引入了Kangaroo,一种旨在解决这些挑战的视频LLM。面对训练数据不足的问题,我们开发了一个数据策展系统,用于构建具有高质量注释的大型规模数据集,用于视觉语言预训练和指令调整。此外,我们还设计了一个具有逐渐增加分辨率和输入帧数的课程培训管道,以适应长视频。评估结果表明,在8B参数的情况下,Kangaroo在各种视频理解基准测试中实现了最先进的性能,同时在其他基准测试中具有竞争力的表现。特别是在针对长视频的基准测试中,Kangaroo在一些具有超过10B参数的专业模型和专有模型上表现出色。
https://arxiv.org/abs/2408.15542
This paper proposes a method for video captioning that controls the length of generated captions. Previous work on length control often had few levels for expressing length. In this study, we propose two methods of length embedding for fine-grained length control. A traditional embedding method is linear, using a one-hot vector and an embedding matrix. In this study, we propose methods that represent length in multi-hot vectors. One is bit embedding that expresses length in bit representation, and the other is ordinal embedding that uses the binary representation often used in ordinal regression. These length representations of multi-hot vectors are converted into length embedding by a nonlinear MLP. This method allows for not only the length control of caption sentences but also the control of the time when reading the caption. Experiments using ActivityNet Captions and Spoken Moments in Time show that the proposed method effectively controls the length of the generated captions. Analysis of the embedding vectors with ICA shows that length and semantics were learned separately, demonstrating the effectiveness of the proposed embedding methods.
本文提出了一种控制生成字幕长度的方法。之前,关于长度控制通常只有很少的级别。在这项研究中,我们提出了两种精细长度嵌入方法。一种传统的嵌入方法是线性的,使用一个一维向量和一个嵌入矩阵。在这项研究中,我们提出了使用多维热编码的嵌入方法。一种是位嵌入,表示长度以位表示;另一种是序嵌入,它使用在序回归中通常使用的二进制表示。这些多维热编码的嵌入方法通过非线性MLP转换为长度嵌入。这种方法不仅允许对摘要句的长度进行控制,还允许控制阅读摘要句时的时间。使用ActivityNet Captions和Spoken Moments in Time进行实验证明,所提出的方法有效地控制了生成摘要句的长度。使用ICA分析嵌入向量,可以看出长度和语义是分别学习的,证明了所提出的嵌入方法的有效性。
https://arxiv.org/abs/2408.15447
Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.
利用音频和视频模态进行视频分类是一个具有挑战性的任务,因为现有的方法需要大型模型架构,导致高计算复杂度和资源需求。较小的架构则很难实现最优性能。在本文中,我们提出了Attend-Fusion,一种专为捕捉视频数据中复杂的音频-视觉关系而设计的音频-视觉(AV)融合方法。通过对具有挑战性的YouTube-8M数据集的广泛实验,我们证明了Attend-Fusion达到75.64\%的F1得分,仅使用72M参数,与具有类似性能的大型基线模型(如Fully-Connected Late Fusion,75.96\% F1 score,341M parameters)相当。Attend-Fusion在大型基线模型的同时减小了模型大小,凸显了其在模型复杂度方面的效率。我们的工作表明,Attend-Fusion模型能够有效地结合音频和视频信息进行视频分类,实现与显著减小模型大小相当的竞争性能。这种方法为在各种应用环境中部署高效的视频理解系统提供了新的可能性。
https://arxiv.org/abs/2408.14441
Multi-modal large language models (MLLMs) have demonstrated considerable potential across various downstream tasks that require cross-domain knowledge. MLLMs capable of processing videos, known as Video-MLLMs, have attracted broad interest in video-language understanding. However, videos, especially long videos, contain more visual tokens than images, making them difficult for LLMs to process. Existing works either downsample visual features or extend the LLM context size, risking the loss of high-resolution information or slowing down inference speed. To address these limitations, we apply cross-attention layers in the intermediate projector between the visual encoder and the large language model (LLM). As the naive cross-attention mechanism is insensitive to temporal order, we further introduce causal cross-attention masks (CCAMs) within the cross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a straightforward two-stage fashion: feature alignment and visual instruction tuning. We develop several Video-CCAM models based on LLMs of different sizes (4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows outstanding performance from short videos to long ones. Among standard video benchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding performances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA, MSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos, Video-CCAM models can be directly adapted to long video understanding and still achieve exceptional scores despite being trained solely with images and 16-frame videos. Using 96 frames (6$\times$ the training number of frames), Video-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among all open-source Video-MLLMs, respectively. The code is publicly available in \url{this https URL}.
多模态大型语言模型(MLLMs)已经在各种需要跨领域知识下游任务中展示了显著潜力。具有处理视频能力的MLLM,称为视频MLLM,引起了对于视频语言理解的广泛兴趣。然而,视频,特别是长视频,含有更多的视觉符号,使得LLMs难以处理。现有的工作不是降低视觉特征的分辨率,就是扩展LLM上下文大小,从而导致高分辨率信息丢失或者推理速度变慢。为了克服这些限制,我们在视觉编码器和解大型语言模型(LLM)的中间投影器应用跨注意机制。由于naive cross-attention机制对时间顺序不敏感,我们还在跨注意层中引入了因果跨注意掩码(CCAMs)。名为Video-CCAM的视频MLLM在两个阶段进行训练:特征对齐和视觉指令调整。我们根据不同大小的LLM开发了几个Video-CCAM模型(4B、9B和14B)。Video-CCAM证明了它是 一个稳健的视频MLLM,并从短视频到长视频表现出色。在包括MVBench和VideoChatGPT-QA的标准视频基准中,Video-CCAM的表现非常突出(在MVBench和TGIF-QA中的第1、2、3名,在MSVD-QA、MSRVTT-QA和ActivityNet-QA中的第2、3、4名)。在涵盖长视频的基准中,经过训练的Video-CCAM模型可以直接应用于长视频理解和仍然取得优异的分数,尽管仅使用图像和16帧视频进行训练。使用96帧(训练帧数的6倍),Video-CCAM模型在VideoVista和MLVU基准中的排名分别为第1、2、3名。代码公开在\url{这个链接}中。
https://arxiv.org/abs/2408.14023
The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5\%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at this https URL.
流媒体平台上的视频爆炸性增长凸显了有效视频质量评估(VQA)算法对监测和感知优化流媒体视频质量的迫切需要。然而,VQA仍然是一个极具挑战性的任务,由于多样性的视频内容和复杂的空间和时间畸变,因此需要更先进的方法来解决这些问题。如今,大型多模态模型(LMMs)如GPT-4V在各种视觉理解任务中表现出强大的能力,因此我们提出了第一个大型多模态视频质量评估(LMM-VQA)模型,该模型引入了一种新颖的空间和时间视觉建模策略来提取质量特征。具体来说,我们首先将质量回归问题重新建模为问答(Q&A)任务,并构建了VQA指令调整的Q&A提示。然后,我们设计了一个空间和时间视觉编码器,用于提取视频的质量特征,接着通过空间和时间投影器映射到语言空间。最后,对齐的视觉标记和质量查询标记作为输入,输入大型语言模型(LLM)以生成质量评分和级别。大量实验证明,LMM-VQA在五个VQA基准测试中都实现了最先进的性能,在一般视频理解任务上的平均改进幅度达到了5%。此外,由于spatiotemporal编码器和解码器的先进设计,LMM-VQA在一般视频理解任务上表现出色,进一步验证了其有效性。我们的代码将发布在上述链接处。
https://arxiv.org/abs/2408.14008
Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP) system that enables long-context training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, context extension, and long-short joint supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates a consistent improvement in performance on long videos within the VideoMME benchmark as the video frames increase.
长上下文能力对于多模态基础模型至关重要。我们引入了LongVILA,一个完整的多模态视觉语言模型解决方案,包括系统、模型训练和数据集开发。在系统方面,我们引入了第一个多模态序列并行(MM-SP)系统,实现了长上下文训练和推理,使得在256个GPU上进行2M上下文长度的训练成为可能。MM-SP还高效,是Ring-Style序列并行和Megatron-LM在文本设置下的1.1倍 - 1.4倍。此外,它与Hugging Face Transformers无缝集成。对于模型训练,我们提出了一个包括对齐、预训练、上下文扩展和长短级联合监督微调的五阶段管道。关于数据集,我们精心构建了大型视觉语言预训练数据集和长视频指令跟随数据集,以支持我们的多阶段训练过程。全栈解决方案将VILA的可实现帧数提高了128倍(从8帧到1024帧),将长视频摘要得分从2.00提高到3.26(1.6倍),在1400帧视频(274K上下文长度)的 haystack 上的准确度达到了99.5%。LongVILA-8B还展示了随着视频帧数的增加,其在VideoMME基准测试中的性能不断提高。
https://arxiv.org/abs/2408.10188
In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional this http URL bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model.
近年来,视频动作识别,作为视频理解领域的基础任务,已经吸引了大量研究者的深入研究。大多数传统视频动作识别方法通常需要将视频转换为包含空间和时间信息的三维数据,然后利用普遍的图像理解模型来建模和分析这些数据。然而,这些方法存在显著的缺点。首先,在深入研究视频动作识别任务时,图像理解模型通常需要相应地适应这些空间和时间任务;其次,处理高维数据往往比处理低维数据更具挑战性,并需要更高的计算成本。为了在图像理解和视频理解任务之间填补空白,简化视频理解复杂度,我们引入了一种新的视频表示架构——平铺(Flatten),它是一个可插拔的模块,可以轻松地集成到任何图像理解网络中,实现高效的3D时间数据建模。 具体来说,通过应用特定的平铺操作(例如行主变换),3D spatiotemporal数据被转换为2D空间信息,然后使用普通图像理解模型来捕捉时间动态和空间语义信息,从而实现有效的视频动作识别。在常见的数据集(Kinetics-400,Something-Something v2和HMDB-51)和三种经典图像分类模型(Uniformer,SwinV2和ResNet)上的大量实验证明,嵌入平铺提供了比原始模型显著的性能改进。
https://arxiv.org/abs/2408.09220