The ability of perceiving fine-grained spatial and temporal information is crucial for video-language retrieval. However, the existing video retrieval benchmarks, such as MSRVTT and MSVD, fail to efficiently evaluate the fine-grained retrieval ability of video-language models (VLMs) due to a lack of detailed annotations. To address this problem, we present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from the FineAction dataset. Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video, making it possible to independently evaluate the spatial and temporal bias of VLMs on video retrieval task. Besides, we employ a text embedding method to unlock the capability of fine-grained video-language understanding of Multimodal Large Language Models (MLLMs). Surprisingly, the experiment results show that our Video Large Language Encoder (VLLE) performs comparably to CLIP-based models on traditional benchmarks and has a stronger capability of fine-grained representation with lower spatial-temporal bias. Project page: this https URL.
感知细微的时空信息对于视频与语言检索至关重要。然而,现有的视频检索基准测试(如MSRVTT和MSVD)由于缺乏详细的注释而无法有效评估视频-语言模型(VLMs)的细粒度检索能力。为了解决这个问题,我们提出了FIBER,即用于文本到视频检索的细粒度基准测试,该测试包含来自FineAction数据集的1,000个视频。特别的是,我们的FIBER基准提供了每个视频详细的、人工注释的空间标注和时间标注,从而能够独立评估VLMs在视频检索任务中的空间偏差和时间偏差。此外,我们采用了一种文本嵌入方法来解锁多模态大型语言模型(MLLMs)的细粒度视频-语言理解能力。令人惊讶的是,实验结果显示我们的视频大型语言编码器(VLLE)在传统基准测试上的表现与CLIP基线模型相当,并且具有更强的细粒度表示能力和较低的空间时间偏差。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2501.00513
Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
多模态表示学习,特别是在对比学习中的应用,在人工智能领域扮演着重要角色。作为其重要的子领域之一,视频-语言表示学习侧重于利用预定义的视频-文本对之间的全局语义交互来学习表示形式。然而,为了增强和细化这种粗粒度的整体互动,需要更详细的互动以进行细粒度多模态学习。在这项研究中,我们引入了一种新方法,该方法采用多元合作博弈理论将视频-文本视为游戏参与者,以此处理在具有多样化粒度、灵活组合以及模糊强度的细粒度语义交互过程中的不确定性。 具体而言,我们设计了层次化的Banzhaf互作模型(Hierarchical Banzhaf Interaction),从多层级视角模拟视频片段与文本词汇之间的细粒度对应关系。此外,为了解决Banzhaf互作计算中的偏置问题,我们提出了通过单模态和跨模态组件融合来重构表示形式的方法。这种重构的表示既保持了与单模态表示相当的细粒度级别,又保留了跨模态表示适应性编码的特点。 此外,我们将原始结构扩展到一个灵活的编码-解码框架中,使模型能够适应各种下游任务。在常用的文本-视频检索、视频问答和视频描述基准测试中的广泛实验表明,我们的方法具有优越的表现力与泛化能力。
https://arxiv.org/abs/2412.20964
Video Corpus Visual Answer Localization (VCVAL) includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. For the Query-Focused Instructional Step Captioning (QFISC) task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.
视频语料库视觉答案定位(VCVAL)包括与问题相关的视频检索和视频中的视觉答案定位。具体来说,我们使用文本到文本的检索方法,根据视频字幕和GPT4生成的答案之间的相似性来找到与医学问题相关的视频。对于视觉答案定位,通过将查询与视觉内容和字幕对齐,预测出答案开始和结束的时间戳。在面向查询的教学步骤字幕生成(QFISC)任务中,步骤字幕由GPT4生成。具体而言,我们提供了LLaVA-Next-Video模型生成的视频字幕和带时间戳的视频字幕后,要求GPT4为给定的医学查询生成步骤字幕。我们仅提交了一次评估运行,并获得了F值11.92以及平均IoU 9.6527的结果。
https://arxiv.org/abs/2412.15514
Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction between visual and audio modalities. Then, we devise the query-centric cognition that uses the deep-level query to perform the temporal-channel filtration on the shallow-level audio-visual representation. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks. Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization.
视频已作为互联网上受欢迎的多媒体格式出现。为了更好地获取视频内容,提出了一个新的主题HIREST,包括视频检索、关键时刻检索、关键时刻分割和步骤注释。这项开创性工作选择了基于预训练CLIP模型的视频检索,并将其用作多任务学习范式中其他三个具有挑战性的任务的特征提取器。然而,由于忽略了跨模态之间的层次关系和关联关系,这项工作难以学习用户偏好内容的全面认知。在本文中,遵循浅层到深层的原则,我们提出了一种以查询为中心的视听认知(QUAG)网络来构建关键时刻检索、分割和步骤注释的可靠多模态表示。具体来说,首先设计了协同感知模式来获取丰富的视听内容,通过建模视觉和音频模态之间的全局对比对齐和局部精细交互。然后,我们制定了以查询为中心的认知方法,使用深层查询在浅层视听表示上执行时间通道过滤。这可以识别用户偏好内容,并为三个任务获得以查询为中心的视听表示。广泛的实验表明QUAG在HIREST中达到了最先进的结果。进一步,我们在基于查询的视频摘要任务上测试了QUAG并验证了其良好的泛化能力。
https://arxiv.org/abs/2412.13543
Current video retrieval systems, especially those used in competitions, primarily focus on querying individual keyframes or images rather than encoding an entire clip or video segment. However, queries often describe an action or event over a series of frames, not a specific image. This results in insufficient information when analyzing a single frame, leading to less accurate query results. Moreover, extracting embeddings solely from images (keyframes) does not provide enough information for models to encode higher-level, more abstract insights inferred from the video. These models tend to only describe the objects present in the frame, lacking a deeper understanding. In this work, we propose a system that integrates the latest methodologies, introducing a novel pipeline that extracts multimodal data, and incorporate information from multiple frames within a video, enabling the model to abstract higher-level information that captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.
当前的视频检索系统,特别是在竞赛中使用的那些,主要侧重于查询单个关键帧或图像,而不是对整个片段或视频段进行编码。然而,查询通常描述的是跨越一系列帧的动作或事件,而不仅仅是特定图像。这导致在分析单一帧时信息不足,从而产生不准确的查询结果。此外,仅从图像(关键帧)中提取嵌入信息不足以提供足够的信息,使模型能够编码更高层次、更抽象的理解,这些理解是从视频中推断出来的。这些模型倾向于仅仅描述帧中存在的对象,缺乏深层次的理解。 在本研究中,我们提出了一种整合最新方法的系统,并引入了一个新颖的管道,该管道提取多模态数据并结合视频中的多个帧的信息,使模型能够提炼出更高层次的信息,捕捉潜在的意义。我们的重点在于从视频片段中推断出来的信息,而不仅仅是关注单个图像中的目标检测。
https://arxiv.org/abs/2412.07584
Content creators often use music to enhance their videos, from soundtracks in movies to background music in video blogs and social media content. However, identifying the best music for a video can be a difficult and time-consuming task. To address this challenge, we propose a novel framework for automatically retrieving a matching music clip for a given video, and vice versa. Our approach leverages annotated music labels, as well as the inherent artistic correspondence between visual and music elements. Distinct from previous cross-modal music retrieval works, our method combines both self-supervised and supervised training objectives. We use self-supervised and label-supervised contrastive learning to train a joint embedding space between music and video. We show the effectiveness of our approach by using music genre labels for the supervised training component, and our framework can be generalized to other music annotations (e.g., emotion, instrument, etc.). Furthermore, our method enables fine-grained control over how much the retrieval process focuses on self-supervised vs. label information at inference time. We evaluate the learned embeddings through a variety of video-to-music and music-to-video retrieval tasks. Our experiments show that the proposed approach successfully combines self-supervised and supervised objectives and is effective for controllable music-video retrieval.
内容创作者经常使用音乐来增强他们的视频,从电影中的配乐到视频博客和社交媒体内容的背景音乐。然而,为视频找到最合适的音乐可能是一项既困难又耗时的任务。为了应对这一挑战,我们提出了一种新型框架,可以自动检索与给定视频相匹配的音乐片段,并且反之亦然。我们的方法利用了带注释的音乐标签,以及视觉元素和音乐元素之间固有的艺术对应关系。与之前的跨模态音乐检索工作不同,我们的方法结合了自我监督和监督学习目标。我们使用自我监督对比学习和标签监督对比学习来训练音乐和视频之间的联合嵌入空间。通过在监督训练组件中使用音乐流派标签,展示了我们方法的有效性,并且我们的框架可以推广到其他类型的音乐注释(如情感、乐器等)。此外,我们的方法能够对检索过程如何在推理时平衡自我监督与标签信息的比例进行细粒度控制。我们通过一系列视频到音乐和音乐到视频的检索任务评估了学习到的嵌入效果。实验结果表明,所提出的方法成功地结合了自我监督和监督目标,并且对于可控制的音乐-视频检索是有效的。
https://arxiv.org/abs/2412.05831
Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
现有的大型视频语言模型(LVLMs)由于上下文限制,难以准确理解长视频。为了解决这个问题,微调具有长上下文的LVLM和使用基于GPT的代理被认为是很有前景的解决方案。然而,微调LVLM需要大量的高质量数据和显著的GPU资源,而基于GPT的代理则依赖于专有的模型(例如,GPT-4o)。在本文中,我们提出了视频检索增强生成(Video-RAG),这是一种无需训练且成本效益高的流程,通过使用视觉对齐的辅助文本帮助跨模态对齐,并提供超出视觉内容之外的信息。具体而言,我们利用开源外部工具从纯视频数据中提取视觉对齐信息(例如,音频、光学字符和目标检测),并将提取到的信息以插件方式作为辅助文本与视频帧和查询一起集成到现有的LVLM中。我们的Video-RAG提供了几个关键优势:(i)由于单次检索而计算开销低;(ii)易于实现,并且可以兼容任何LVLM;以及(iii)在长视频理解基准测试(如Video-MME、MLVU和LongVideoBench)上表现出显著且一致的性能提升。值得注意的是,当使用72B模型时,我们的模型在性能上优于专有模型Gemini-1.5-Pro和GPT-4o。
https://arxiv.org/abs/2411.13093
Contextual advertising serves ads that are aligned to the content that the user is viewing. The rapid growth of video content on social platforms and streaming services, along with privacy concerns, has increased the need for contextual advertising. Placing the right ad in the right context creates a seamless and pleasant ad viewing experience, resulting in higher audience engagement and, ultimately, better ad monetization. From a technology standpoint, effective contextual advertising requires a video retrieval system capable of understanding complex video content at a very granular level. Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources, limiting their practicality and lacking the key functionalities required for ad ecosystem integration. We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising. ContextIQ utilizes modality-specific experts-video, audio, transcript (captions), and metadata such as objects, actions, emotion, etc.-to create semantically rich video representations. We show that our system, without joint training, achieves better or comparable results to state-of-the-art models and commercial solutions on multiple text-to-video retrieval benchmarks. Our ablation studies highlight the benefits of leveraging multiple modalities for enhanced video retrieval accuracy instead of using a vision-language model alone. Furthermore, we show how video retrieval systems such as ContextIQ can be used for contextual advertising in an ad ecosystem while also addressing concerns related to brand safety and filtering inappropriate content.
情境广告投放的是与用户正在查看的内容相匹配的广告。随着社交平台和流媒体服务上的视频内容快速增长,以及隐私问题的关注增加,对情境广告的需求也越来越大。在正确的情境中放置合适的广告可以创造一个无缝且愉悦的广告观看体验,从而提高观众参与度,并最终实现更好的广告变现效果。从技术角度来看,有效的场景化广告投放需要一个能够以非常精细的粒度理解复杂视频内容的视频检索系统。当前基于联合多模态训练的文本到视频检索模型要求大量数据集和计算资源,这限制了其实用性,并且缺乏将广告生态系统整合所需的关键功能。 我们介绍了一种名为ContextIQ的多模态专家型视频检索系统,专门用于情境广告投放。ContextIQ利用特定模式的专家——包括视频、音频、字幕(字幕)、以及对象、动作、情感等元数据——来创建语义丰富的视频表示形式。我们的研究表明,在不进行联合训练的情况下,该系统在多个文本到视频检索基准上达到了优于或与最先进的模型和商用解决方案相当的结果。我们的消融研究强调了利用多种模态提高视频检索准确性的好处,而不仅仅是使用单一的视觉语言模型。此外,我们展示了像ContextIQ这样的视频检索系统如何在广告生态系统中用于情境广告投放,同时也解决了品牌安全以及过滤不适当内容的相关问题。
https://arxiv.org/abs/2410.22233
We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at this https URL, and a demo of our signed instruction video retrieval system is available at this https URL.
https://arxiv.org/abs/2410.14026
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $\textbf{MultiVENT 2.0}$, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.
https://arxiv.org/abs/2410.11619
Text-video retrieval (TVR) has seen substantial advancements in recent years, fueled by the utilization of pre-trained models and large language models (LLMs). Despite these advancements, achieving accurate matching in TVR remains challenging due to inherent disparities between video and textual modalities and irregularities in data representation. In this paper, we propose Text-Video-ProxyNet (TV-ProxyNet), a novel framework designed to decompose the conventional 1-to-N relationship of TVR into N distinct 1-to-1 relationships. By replacing a single text query with a series of text proxies, TV-ProxyNet not only broadens the query scope but also achieves a more precise expansion. Each text proxy is crafted through a refined iterative process, controlled by mechanisms we term as the director and dash, which regulate the proxy's direction and distance relative to the original text query. This setup not only facilitates more precise semantic alignment but also effectively manages the disparities and noise inherent in multimodal data. Our experiments on three representative video-text retrieval benchmarks, MSRVTT, DiDeMo, and ActivityNet Captions, demonstrate the effectiveness of TV-ProxyNet. The results show an improvement of 2.0% to 3.3% in R@1 over the baseline. TV-ProxyNet achieved state-of-the-art performance on MSRVTT and ActivityNet Captions, and a 2.0% improvement on DiDeMo compared to existing methods, validating our approach's ability to enhance semantic mapping and reduce error propensity.
https://arxiv.org/abs/2410.06618
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
对比语言图像预训练(CLIP)已经在许多应用中得到了广泛研究和应用。然而,在预训练过程中强调简短的总结文本,使得CLIP无法理解长描述。对于视频来说,这个问题尤为突出,因为视频通常包含丰富的详细内容。在本文中,我们提出了VideoCLIP-XL(扩展长度)模型,旨在释放视频CLIP模型的长描述理解能力。首先,我们建立了一个自动数据收集系统,收集了大规模的VILD预训练数据集,包括Video和Long-Description对。然后,我们提出了文本相似性引导的主成分匹配(TPCM)来更好地学习特征空间的同时扩展长描述能力。我们还引入了两个新任务,即 detail-aware 描述排名(DDR)和 Hallucination-aware 描述排名(HDR),以进一步改进理解能力。最后,我们构建了一个 Long Video Description Ranking(LVDR)基准,用于更全面地评估长描述能力。与短描述和长描述的广泛使用的文本视频检索基准以及我们的LVDR基准的实验结果可以充分证明我们方法的的有效性。
https://arxiv.org/abs/2410.00741
Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks.
文本-视频检索(TVR)方法通常通过粗粒度、细粒度或结合(粗-细)方式对查询-候选对进行匹配。然而,这些框架主要采用一对一(查询)一对应(候选者)的匹配范式,很难区分候选者之间的细微差异,导致频繁的匹配错误。受到人类认知科学中比较判断的影响,我们提出了TokenBinder。这个创新的两阶段TVR框架引入了一种新颖的一对多粗-细对齐范式,类似于人类认知过程中在大规模集合中识别特定项的过程。我们的方法采用了一个具有复杂交叉注意力的聚焦视图融合网络,动态地对多个视频中的特征进行同步对齐和比较,以捕捉更细粒度和上下文变化。在六个基准数据集上的大量实验证实,TokenBinder显著优于现有最先进的方法。这些结果证明了TokenBinder的稳健性以及其在TVR任务中细粒度对齐的有效性。
https://arxiv.org/abs/2409.19865
Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully leverage useful information in multimodal video content (frames, tags, ASR transcripts, etc.) to refine the original annotations. Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel noise control method that requires weaker assumptions on noise distribution, thereby proving more effective in large datasets with theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.
近年来,通过大规模预训练,视频语言理解取得了巨大的成功。然而,数据稀缺仍然是一个持续的挑战。本研究在预训练数据集上揭示了数据量、多样性和质量之间的“不可能三角”关系。为了通过合成注释来修复质量低的大规模、多样 ASR 数据集,近期的努力尝试利用多模态视频内容的有用信息来优化原始注释。这些方法成功地利用了多模态视频内容(帧、标签、ASR 转录等)中的有用信息来优化原始注释。然而,它们在合成注释内很难消除噪声,并且在数据集规模扩大时缺乏可扩展性。为了应对这些问题,我们引入了 Video DataFlywheel 框架,该框架通过逐步优化带有更好噪声控制方法的视频注释进行迭代改进。对于迭代改进,我们首先利用视频语言模型生成合成注释,从而得到一个优化后的数据集。然后,我们对它进行预训练,并在人类改进实例上进行微调,以获得更强的模型。这些过程会重复进行以持续改进。为了噪声控制,我们提出了 AdaTaiLr,一种新颖的噪声控制方法,它对噪声分布的弱假设,从而在大型数据集上证明更有效的效果。结合迭代改进和 AdaTaiLr,可以在视频语言理解中实现更好的可扩展性。大量实验证明,我们的框架超越了现有的数据修复基线,实现了3%的性能提升,并且在最小多样性损失的情况下提高了数据质量。此外,经过优化的数据集对各种视频语言理解任务都取得了显著的改进,包括视频问答和文本视频检索。
https://arxiv.org/abs/2409.19532
Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
借鉴物理运动灵感的想法,我们提出了一个新的自监督动态学习策略:视频实例区分(ViDiDi)。ViDiDi是一种简单且数据有效的策略,可以迅速应用于基于实例歧视的自监督视频表示学习框架。在其核心,ViDiDi通过观察帧序列的各个时间导数来关注视频的不同方面。这些导数与原始帧一起支持离散时间下的连续动态的泰勒级数展开。高阶导数突显了更高阶的运动特征。ViDiDi通过一种平衡交替学习算法学习一个视频及其时间导数的一一对应的可视化表示。通过学习原始帧和导数的 consistent representations,编码器被引导强调动态特征,揭示原始帧中的隐藏动态。因此,视频表示更好地通过动态特征进行区分。我们将ViDiDi集成到现有的实例歧视框架(VICReg,BYOL和SimCLR)中,用于对UCF101或Kinetics进行预训练,并在包括视频检索、动作识别和动作检测的标准基准上进行测试。性能通过不需要大型模型或大量数据而显著提高。
https://arxiv.org/abs/2409.02371
Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: this https URL.
temporal video alignment 旨在将两个视频中的关键事件(如物体交互或动作阶段转换)同步。这类方法可以为各种视频编辑、处理和理解任务提供有益的帮助。然而,现有方法在假设给出合适的视频对进行对齐时非常有限,这使得它们的适用性受到很大限制。为了解决这个问题,我们将 temporal alignment 重新定位为搜索问题,并引入了 Alignable Video Retrieval(AVR)任务。给定查询视频,我们的方法可以从大量视频片段中找到 well-alignable 视频,并将其与查询视频同步。为实现这一目标,我们做出了以下三个关键贡献:1)我们引入了 DRAQ,这是一种视频对齐度指标,用于识别和重新排名一组候选视频中的最佳对齐视频;2)我们提出了一个有效且通用视频特征设计,以提高多个离散特征表示的同步性能;3)我们提出了使用循环一致性度量的新 benchmark 和评估协议 for AVR。我们对 3 个数据集的实验结果表明,我们的方法在从多样数据集中确定对齐视频对方面非常有效。项目页面:此链接。
https://arxiv.org/abs/2409.01445
Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model's high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.
大多数文本视频检索方法都利用预训练的CLIP作为骨干,并包含复杂的模块,导致计算开销较高。因此,许多研究关注高效调整。高效适应的主要挑战来自于图像和视频模态固有的差异。每个随机视频帧都必须由图像编码器独立处理,这增加了复杂性并使得实际部署更加复杂。尽管现有的高效方法通过较小的训练参数进行微调,但由于具有较大的标记数量,它们仍然导致较高的推理成本。在本文中,我们认为时间冗余对模型由于连续帧间信息的重复而导致的复杂度高有重大影响。现有的图像模型标记压缩方法未能解决这一独特挑战,因为它们忽视了帧间的时间冗余。为了解决这些问题,我们提出了 Temporal Token Merging(TempMe)来减少时间冗余。具体来说,我们引入了一个渐进多粒度框架。通过逐渐合并相邻片段,我们跨越不同帧合并时间标记,并学习视频级特征,从而降低复杂性并获得更好的性能。大量实验证实了 TempMe 的优越性。与之前的文本视频检索方法相比,TempMe 通过减少输出标记 by 95% 和 GFLOPs by 51% 显著提高了性能。此外,TempMe 通过有效地集成高效的和完全调整方法展示了稳健的泛化能力。在完全调整后,TempMe 的 R-Sum 改进达到了 7.9%,训练速度达到 1.57X,并且使用了 75.2% 的GPU内存。我们的代码将发布。
https://arxiv.org/abs/2409.01156
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at this https URL.
当前的文本-视频检索方法主要依赖于查询和视频之间的跨模态匹配来计算它们的相似度分数,然后对结果进行排序。这种方法考虑了每个候选视频与查询之间的匹配,但它需要显著的时间成本,并且在候选视频增加时会增加显著。在自然语言处理和计算机视觉中,生成模型很常见,并且在文档检索中取得了成功应用,但在多模态检索中的应用仍然是一个未探索的领域。为了提高检索效率,本文提出了一种基于模型的视频索引器T2VIndexer,它是一种直接生成视频标识的序列到序列生成模型,用于检索具有恒定时间复杂度的候选视频。T2VIndexer的目标是在保持高准确性的同时减少检索时间。为实现这一目标,我们提出了视频标识编码和查询标识增强方法,将视频表示为短序列,同时保留其语义信息。我们的方法在四个标准数据集上 consistent地增强了当前最先进模型的检索效率。它使得仅使用原始检索时间的30\%-50\%的基线模型就能实现更好的 MSR-VTT (+1.0%)、MSVD (+1.8%)、ActivityNet (+1.5%) 和 DiDeMo (+0.2%) 上的检索性能。代码可在此处访问:https://www.xxx.com/。
https://arxiv.org/abs/2408.11432
Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.
文本-视频检索(TVR)旨在将相关视频内容与相应的自然语言查询对齐和关联。大多数现有的TVR方法都是基于大规模预训练的视觉-语言模型(例如,CLIP)。然而,由于CLIP固有的简单结构,很少有TVR方法探索多尺度表示,这些多尺度表示能提供对更全面理解的丰富上下文信息。为此,我们提出了MUSE,一种具有线性计算复杂性的多尺度MAMBA,用于高效的跨分辨率建模。具体来说,多尺度表示是由在最后一个单尺度特征图上应用特征金字塔生成的。然后,我们利用Mamba结构作为高效的多尺度学习器,共同学习尺度上的表示。此外,我们进行了全面的研究,以探究不同的模型结构和设计。在三个流行的基准测试上,MUSE取得了很好的结果,验证了其优越性。
https://arxiv.org/abs/2408.10575
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations. Composition understanding becomes particularly challenging for video data since the compositional relations rapidly change over time in videos. We first build a benchmark named AARO to evaluate composition understanding related to actions on top of spatial concepts. The benchmark is constructed by generating negative texts with incorrect action descriptions for a given video and the model is expected to pair a positive text with its corresponding video. Furthermore, we propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding. We also develop a negative-augmented visual-language matching loss which is used explicitly to benefit from the generated negative text. We compare NAVERO with other state-of-the-art methods in terms of compositional understanding as well as video-text retrieval performance. NAVERO achieves significant improvement over other methods for both video-language and image-language composition understanding, while maintaining strong performance on traditional text-video retrieval tasks.
我们研究了Video-Language(VidL)模型在理解物体、属性、动作及其关系的组合方面的能力。对于视频数据,组合理解变得特别具有挑战性,因为组合关系在视频中会快速变化。首先,我们构建了一个名为AARO的基准来评估与动作相关的组合理解。基准是通过为给定视频生成错误的动作描述的负向文本来构建的,模型被期望将积极文本与其相应视频进行匹配。此外,我们提出了一个利用视频文本数据增强负向文本的训练方法NAVERO,以增强组合理解。我们还开发了一个负向文本增强的视觉语言匹配损失,该损失用于明确利用生成的负向文本。我们比较NAVERO与其他最先进的 methods在组合理解和视频文本检索性能方面的表现。NAVERO在视频语言和图像语言的组合理解方面取得了显著的改进,同时保持传统文本-视频检索任务的强大性能。
https://arxiv.org/abs/2408.09511