A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably enhances video-text retrieval by a relative improvement of 8.3\% in video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms of R@1. Additionally, in video moment retrieval, average mAP shows a relative improvement ranging from 2.0\% to 13.7 \% across different baselines.
对文本数据的深入理解是多模态视频分析任务中的基本要素。然而,最近的工作表明,当前的模型在目标下游任务的训练过程中无法全面理解文本数据。与以前的方法不同,我们假设根据目标任务理解句子成分可能有助于提高模型的性能。因此,我们利用预训练的大型语言模型(LLM)生成针对原始文本的文本样本,针对特定的句子成分进行定向。我们提出了一个弱监督的重要性估计模块,计算组件的相对重要性,并利用它们来改善不同的视频-语言任务。通过严格的定量分析,我们提出的方法在多个视频-语言任务上都取得了显著的改进。特别是,我们的方法在视频-文本检索方面显著增强了视频-到-文本和文本-到-视频检索的相对改善率,在R@1方面,相对改善了8.3%。此外,在视频时刻检索中,平均mAP在不同的基线之间显示出相对改善,从2.0%到13.7%不等。
https://arxiv.org/abs/2312.06699
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.
在本文中,我们提出了一种高效且高性能的部分相关视频检索(PRVR)方法,旨在检索输入文本查询中包含至少一个相关时刻的未剪辑长视频。在效率和性能方面,之前的研究被忽视的一个瓶颈是密帧的视觉编码。这使得研究者选择轻量级的视觉骨干,但由于其学习到的视觉表示能力有限,导致检索性能低于他们的能力。然而,简单地用高性能的大规模视觉与语言模型(VLMs)替换它们并不理想,因为它们的效率太低了。为了应对这些问题,我们关注超图像,这是通过将视频帧按照 $N \times N$ 的网格布局重新排列来创建的。这减少了视觉编码的数量至 $\frac{1}{N^2}$,并弥补了大规模 VLMs 的低效率,使我们可以将它们用作强大的编码器。令人惊讶的是,我们发现,通过一个简单的查询图像关注技巧,VLMs 很好地向超图像进行扩展,并高效地对抗了目前的最优方法。此外,我们通过将几个可训练的模块集成到 VLM 骨干网络中,提出了一种微调方法。实验结果表明,我们的方法在活动网络捕捉和 TVR 上实现了最佳性能。
https://arxiv.org/abs/2312.00414
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at this https URL.
现有的长视频检索系统在段落到视频检索模式下进行训练和测试,其中每个长视频都由一段长描述来描述。这忽略了描述视频中可能存在的丰富性和多样性,可以详细描述视频中的每个时刻,或者用一个短语概述,或者在其中的任何地方。为了对长视频检索系统的能力进行更彻底的评估,我们提出了一个利用最先进的较大语言模型生成一系列丰富多样的合成视频描述的管道。我们通过严谨的人检查来验证这个管道的可靠性。然后,我们使用几个大型视频数据集来基准这些合成视频描述的语言模型,发现它们在转换数据上表现不佳,尤其是最短的描述。我们还提出了一种轻量级的微调方法,我们使用对比损失来基于各种描述之间信息差异的程度学习层次嵌入损失。我们的方法在下游段落到视频检索任务(在ActivityNet上的R@1值+1.1%)以及我们使用合成数据计算的各种长视频检索指标上都取得了良好的性能(在ActivityNet上的短描述上的R@1值+3.6%)。对于数据访问和其他细节,请参阅我们的项目网站,链接在此:https://www.aclweb.org/anthology/N22-21-6666。
https://arxiv.org/abs/2312.00115
Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (text + video) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, demonstrating that the goal of generalizable procedural video understanding models is far out and underscoring the need to develop new approaches to these tasks. Data, model, and code will be publicly released.
学习视频是一个新兴的研究领域,它使机器人能够从人类演示中获取技能,例如程序视频。为此,视频语言模型必须能够获得结构化的理解,例如将演示的时间分割为一系列动作和技能的时间序列,并将理解泛化到新的领域。为了实现这一目标,我们引入了Spacewalk-18,一个包含两个任务的基准:(1)步骤识别;(2)在国际空间站空间行走录音数据集中的视频内检索。与此同时,这两个任务衡量了模型利用以下能力:1)跨域视觉信息;2)高时间上下文窗口;3)多模态(文本+视频)领域。这不同于现有程序视频理解的基准,通常处理短上下文长度,并且可以用单一模式解决。Spacewalk-18,由于其固有的多模态和长形式复杂性,揭示了任务识别和分割的高难度。我们发现,最先进的方法在我们的基准上表现不佳,这表明泛化程序视频理解模型的目标是远远超出了,并强调了开发新的方法来解决这些任务的需求。数据、模型和代码将公开发布。
https://arxiv.org/abs/2311.18773
To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its compactness, is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks including video question answering, text-to-video retrieval, etc. In particular, our E-ViLM obtains obvious efficiency improvements by reaching competing performances with faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also provide extensive ablative studies that validate the effectiveness of our proposed learning schema for E-ViLM.
要为具有挑战性的现实世界任务构建可扩展的模型,了解各种形式(如视频、文本和图像)的多样多模态数据非常重要。在现有工作中,许多作品集中利用大型但笨重的跨模态架构。尽管它们的有效性,但较大的架构无法将模型扩展到现实世界的应用中,因此构建轻量级的VL架构和高效的学习模式具有很大的实际价值。在本文中,我们提出了一个高效的视频-语言模型(称为E-ViLM)和一个掩码视频建模(MVM)方案,并使用语义向量量化tokenizer进行辅助。特别地,我们的E-ViLM学会了从预训练的向量量化tokenizer产生的遮罩视频区域的语义标签中重构语义标签,该tokenizer将连续的视觉信号离散化成了标签。我们证明了,即使我们的简单MVM任务和常规VL预训练模型比较紧凑,我们的E-ViLM也能够从视频-语言数据集中学习到有表达性的表示,并且在广泛的视频-语言任务(包括视频问答和文本-到-视频检索等)上表现良好。特别是,我们的E-ViLM通过达到与具有更快的推理速度的竞争水平取得了显著的效率改进,即我们的模型在MSRVTT基准上达到$39.3\%$的Top-$1$准确率,仅保留$15\%$的参数和$94.8\%$的GFLOPs。我们还提供了广泛的差分研究,证实了我们的学习方案对E-ViLM的有效性。
https://arxiv.org/abs/2311.17267
Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at this https URL.
大型的预训练视觉模型在计算机视觉方面取得了令人印象深刻的成功。然而,为下游任务(特别是视频理解)完全对大型模型进行微调可能会导致计算成本过高。最近的研究将他们的注意力转向了高效的图像到视频传输学习。然而,现有的高效微调方法缺乏对训练内存使用和将较大模型转移到视频领域的探索。在本文中,我们提出了一个名为Side4Video的新 Spatial-Temporal Side Network,用于记忆高效的微调大型图像模型以视频理解,具体来说,我们附加了一个轻量级的空间-时间侧网络附着在冻结的视觉模型上,避免了通过沉重预训练模型进行反向传播,并利用原始图像模型的多级空间特征。极具内存效率的架构使我们能够将内存使用量减少75%,比以前基于适配器的方法实现更大的ViT-E(4.4B)和ViT-L(304M)。通过这种方式,我们可以将ViT-E(4.4B)用于视频理解任务,这是ViT-L(304M)的14倍。我们的方法在各种视频数据集上的各种任务(即单模态和跨模态任务,如动作识别和文本-视频检索)上的表现令人印象深刻,特别是在 Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)上。我们的代码发布在以下 URL 上。
https://arxiv.org/abs/2311.15769
Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at this https URL.
尽管预先训练在大量数据上,最先进的视频语言对齐模型对视频摘要中的语义可解释变化不够稳健。我们的工作通过确定一系列对比性错误对齐,例如替换实体、动作和颠倒事件顺序等,这些对齐模型应该对语义可解释变化具有稳健性。为此,我们引入了VideoCon,一个由大型语言模型构建的视频语言对齐数据集,生成原视频摘要和对比视频摘要之间的合理对比。然后,用VideoCon微调生成式视频语言模型来评估视频语言等价性和生成解释。基于VideoCon的视频语言对齐模型在人类生成的对比摘要任务上显著优于当前模型。它在大规模文本到视频检索(SSv2-Temporal)和视频问题回答(ATP-Hard)等temporally-extensive视频语言任务上的AUC提高了12个点。此外,我们的模型在新颖视频和人类生成的摘要和解释上表现出色。我们的代码和数据可在此处访问:https://www.youtube.com/watch?v=QzQgPvSn_Pk&t=0s
https://arxiv.org/abs/2311.10111
A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.
近期在多模态检索的一个趋势是使用双软max损失(DSL)通过双软max损失(DSL)对多模态检索测试集结果进行后处理。虽然这种方法可以带来显著的改进,但它通常假定整个测试样本矩阵作为DSL输入。本文介绍了一种基于Sinkhorn变换的新后处理方法,该方法优于DSL。此外,我们提出了一个不需要访问多个测试查询的新后处理设置。我们证明了我们的方法可以显著提高诸如CLIP4Clip、BLIP、X-CLIP和DRL等先进模型的性能,从而在访问整个测试集和在单查询设置上实现新的最先进水平。
https://arxiv.org/abs/2311.08143
Fashion stylists have historically bridged the gap between consumers' desires and perfect outfits, which involve intricate combinations of colors, patterns, and materials. Although recent advancements in fashion recommendation systems have made strides in outfit compatibility prediction and complementary item retrieval, these systems rely heavily on pre-selected customer choices. Therefore, we introduce a groundbreaking approach to fashion recommendations: text-to-outfit retrieval task that generates a complete outfit set based solely on textual descriptions given by users. Our model is devised at three semantic levels-item, style, and outfit-where each level progressively aggregates data to form a coherent outfit recommendation based on textual input. Here, we leverage strategies similar to those in the contrastive language-image pretraining model to address the intricate-style matrix within the outfit sets. Using the Maryland Polyvore and Polyvore Outfit datasets, our approach significantly outperformed state-of-the-art models in text-video retrieval tasks, solidifying its effectiveness in the fashion recommendation domain. This research not only pioneers a new facet of fashion recommendation systems, but also introduces a method that captures the essence of individual style preferences through textual descriptions.
历史上,时装造型师曾将消费者需求与完美装备之间的差距联系起来,这涉及到复杂的色彩、图案和材料的组合。尽管最近在时尚推荐系统中的进步使得套装搭配预测和互补项的检索达到了一定的效果,但这些系统仍然高度依赖预先选择好的客户选择。因此,我们引入了一种创新的方法来进行时尚推荐:基于用户提供的文本描述的完整装备集检索任务。我们的模型在语义层上设计为-物品、风格和装备,每个层次都会逐步聚合数据以形成基于文本输入的连贯装备推荐。在这里,我们利用类似于对比语言-图像预训练模型的策略来解决套装集中复杂的风格矩阵。利用马里兰大学 polyvore 和 polyvore 时尚数据集,我们的方法在文本-视频检索任务中显著超过了最先进的模型,巩固了在时尚推荐领域中的有效性。这项研究不仅开创了时尚推荐系统的一个新领域,而且通过文本描述捕捉到了个人风格偏好的本质。
https://arxiv.org/abs/2311.02122
Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.
文本转视频检索(TVR)旨在在给定查询文本的情况下,找到一个大型视频画廊中最相关的视频。视频的复杂性和丰富性使得TVR的性能和效率受到挑战。为处理序列化的视频上下文,现有方法通常选择视频中的一个子集来代表视频内容进行TVR。如何选择最具代表性的帧是一个关键问题,以便选定的帧不仅保留视频的语义信息,而且还通过排除时间上冗余的帧来提高检索效率。在本文中,我们对TVR的帧选择进行了首次实证研究。我们系统地分类现有帧选择方法为文本无关和文本指导两类,然后对这六种不同的帧选择在有效性和效率方面进行了详细分析。其中,本文提出了两种帧选择方法。根据多个TVR基准全面分析的结果,我们通过实证研究得出结论,合理选择帧的TVR可以在不牺牲检索性能的情况下显著提高检索效率。
https://arxiv.org/abs/2311.00298
Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.
大规模视频语言预训练在推动视频语言理解任务方面取得了显著的进步。然而,视频编码的繁重计算负担仍然是一个效率瓶颈,特别是对于长视频。由于其固有的3D属性和时空冗余,这些视频包含大量的视觉标记。这使得捕捉复杂的时间和空间关系变得具有挑战性。为了应对这个问题,我们提出了一个名为TEmporal-Spatial Token Aggregation(TESTA)的效率方法。TESTA通过自适应地聚合相似的帧以及每帧相似的补丁来压缩视频语义。TESTA可以将视频语义减少75%,从而加速视频编码。在此基础上,我们引入了一个带有分割空间时间标记聚合模块的视频编码器块预训练视频语言模型。我们在每个视频编码器块上评估我们的模型,并对用于段落到视频检索和长形式视频QA的五种数据集进行实验。实验结果表明,TESTA通过1.7倍于计算效率的改进提高了计算效率,并从处理更长输入帧的规模中获得了显著的性能提升,例如在QuerYD上的R@1值为+13.7,在Condensed Movie上的R@1值为+6.5。
https://arxiv.org/abs/2310.19060
Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.
将视频压缩为二进制代码可以提高检索速度并减少存储开销。然而,由于视频帧的高局部相关性和复杂全局依赖关系,特别是缺乏标签,学习精确的哈希码对于视频检索可能会具有挑战性。现有的自监督视频哈希方法在设计富有表现力的时间编码器方面有效,但由于更简单和不可靠的学习任务,没有充分利用视频的时间 dynamics 和空间特征。为了应对这些挑战,我们首先利用对比学习任务来捕捉视频的全局时空信息进行哈希。通过我们设计的增强策略,重点关注空间和时间变化以创建积极对,学习框架可以生成对运动、缩放和视点的鲁棒哈希码。此外,我们还引入两个协作学习任务,即帧序验证和场景切换 regularization,以捕捉视频帧内的局部时空细节,从而增强对时间结构和空间关系的感知,建模。我们提出的具有全局-局部时空信息的自监督哈希链(CHAIN)在四个视频基准数据集上优于最先进的自监督视频哈希方法。我们的代码将发布。
https://arxiv.org/abs/2310.18926
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query visual features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and employ a video-to-text model to convert the query video to its descriptive form. The resulting descriptions contain vital visual cues of the query video, such as what objects are present and their spatio-temporal interactions. These descriptive cues provide additional semantic knowledge to VLMs to enhance their zeroshot performance. Second, we propose video-specific prompts to LLMs to generate more meaningful descriptions to enrich class label representations. Specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for class names, offering a higher-level action context for additional visual cues, We demonstrate the effectiveness of our approach in video understanding across three different zero-shot settings: 1) video action recognition, 2) video-to-text and textto-video retrieval, and 3) time-sensitive video tasks. Consistent improvements across multiple benchmarks and with various VLMs demonstrate the effectiveness of our proposed framework. Our code will be made publicly available.
视觉语言模型(VLMs)通过计算视觉特征和基于文本的分类标签表示之间的相似度分数对查询视频进行分类。最近,大型语言模型(LLMs)已被用于通过增强分类标签的描述性来丰富文本基分类。然而,这些改进仅限于基于文本的分类器,而查询视觉特征并未被考虑。在本文中,我们提出了一种结合预训练的区分性VLMs和预训练生成视频到文本和文本到文本模型的框架。我们引入了两个关键的修改:首先,我们提出语言指导的视觉特征增强,并使用视频到文本模型将查询视频转换为描述形式。得到的描述包含了查询视频的重要视觉线索,例如存在的物体及其空间和时间相互作用。这些描述性线索为VLMs提供了额外的语义知识,以提高其零 shot性能。其次,我们为LLMs提出了视频特定提示,以生成更有意义的描述来丰富分类标签表示。具体来说,我们引入了提示技术创建了分类名称的树层次结构,为附加视觉线索提供了更高层次的动作上下文。我们在三个不同的零 shot设置中证明了我们方法的有效性:1)视频动作识别,2)视频到文本和文本到视频检索,3)时间敏感视频任务。在多个基准和各种VLMs上的持续改进表明,我们所提出的框架是有效的。我们的代码将公开可用。
https://arxiv.org/abs/2310.15324
Unsupervised video hashing usually optimizes binary codes by learning to reconstruct input videos. Such reconstruction constraint spends much effort on frame-level temporal context changes without focusing on video-level global semantics that are more useful for retrieval. Hence, we address this problem by decomposing video information into reconstruction-dependent and semantic-dependent information, which disentangles the semantic extraction from reconstruction constraint. Specifically, we first design a simple dual-stream structure, including a temporal layer and a hash layer. Then, with the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval, while the temporal layer learns to capture the information for reconstruction. In this way, the model naturally preserves the disentangled semantics into binary codes. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-arts on three video benchmarks.
无监督视频哈希通常通过学习重构输入视频来优化二进制码。这样的重构约束在帧级时间上下文变化上花费了很大的精力,而没有重点关注对于检索更有用的视频级别的全局语义。因此,我们通过将视频信息分解为重构依赖于和语义依赖于信息来解决这个问题。具体来说,我们首先设计了一个简单的双流结构,包括一个时间层和一个哈希层。然后,在从自监督中获得语义相似性知识的帮助下,哈希层学会了捕捉语义检索所需的信息,而时间层学会了捕捉重构所需的信息。在这种情况下,模型自然地将二进制码中的分离语义保留在编码中。通过全面的实验验证,我们的方法在三个视频基准测试中始终超越了最先进的水平。
https://arxiv.org/abs/2310.08009
Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a \textbf{G}aussian-\textbf{M}ixture-\textbf{M}odel based Trans\textbf{former} which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (\ie, TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer.
部分相关视频检索(PRVR)旨在在数据库中寻找未剪辑的视频,其中关键时刻相关。对于PRVR,片段建模对于捕捉文本和视频之间的部分关系至关重要。当前的PRVR方法采用扫描基于片段构建的方法来实现显式的片段建模,这是冗余的信息,需要较大的存储开销。为解决PRVR方法的效率问题,本文提出了GMMFormer,一种基于Transformer的带有Gaussian-Mixture-Model约束的片段建模方法。在帧交互过程中,我们将GMM-Mixture-Model约束融入其中,使每个帧集中于其相邻帧而不是整个视频。这样生成的表示将包含多尺度片段信息,实现隐式片段建模。此外,PRVR方法忽略了相关视频文本查询之间的语义差异,导致稀疏嵌入空间。我们提出了一个查询多样损失,以区分这些文本查询,使嵌入空间更加丰富,包含更多的语义信息。在三个大型视频数据集(即TVR、ActivityNet Captions和Charades-STA)上的大量实验证明GMMFormer的优越性和高效性。
https://arxiv.org/abs/2310.05195
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at this https URL.
尽管在零散图像识别中取得显著的成果,但关于零散视频识别的可能性,投入的努力还相对较少。本文介绍了一个简单而有效的框架Open-VCLIP++,将CLIP适应于强大的零散视频分类器,能够识别测试期间的全新动作和事件。Open-VCLIP++对CLIP进行了最小修改,以捕捉视频中的空间时间关系,从而在尝试通用性的同时创建了一个专用视频分类器。我们正式证明,训练Open-VCLIP++等同于在零散数据上进行持续学习。为解决这个问题,我们引入了平滑权重优化技术,这种技术利用了在训练和测试期间进行权重平滑的优势。此外,我们还基于大型语言模型产生了精细的视频描述。这些详细的描述与视频特征进一步对齐,有助于将CLIP更好地应用于视频领域。我们对三个广泛使用的动作识别数据集进行了评估,采用了多种零散评估协议。结果表明,我们的方法在现有技术水平上显著领先。具体来说,我们在UCF、HMDB和Kinetics-600数据集上分别实现了88.1%、58.7%和81.2%的零散准确率,比最佳替代方法分别领先8.5%、8.2%和12.3%。我们还评估了我们的方法在MSR-VTT视频文本检索数据集上的表现,它在该数据集上实现了与其它方法相当的竞争视频到文本和文本到视频的检索性能,同时大大减少了细粒度的数据量。代码发布在https://这个URL上。
https://arxiv.org/abs/2310.05010
Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
基础多模态模型在大型图像-文本对或视频-文本对上预训练已经展示了在下游任务中强大的泛化能力。然而,与图像-文本模型不同,预训练视频-文本模型在收集大型且干净的同步数据集方面总是不可行,并且在预训练阶段涉及到的计算成本是指数级的。因此,相关的问题是:图像-文本模型是否可以适应视频任务?使用这些模型是否比直接在视频上进行预训练更有益?在本研究中,我们通过在一个零散设置的视频理解任务上对图像-文本模型的泛化能力进行详细研究来回答这个问题。我们在多样化的视频任务中研究了9个基本图像-文本模型,包括视频动作识别(视频AR)、视频检索、视频问题回答(视频QA)、视频多项选择(视频MC)和视频字幕(视频CP)。我们的实验结果表明,图像-文本模型在视频AR、视频RT和视频MC上表现出令人印象深刻的性能。此外,它们在视频字幕上的表现适中,在视频QA上的表现较差。这些发现阐明了在避免昂贵的预训练步骤的同时将基础图像-文本模型适应各种视频任务的价值。
https://arxiv.org/abs/2310.04914
Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal learning. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.
教学视频是一个非常好的学习多模态表示的方法,通过利用自动语音识别系统(ASR)从视频的音频信号中提取的与视频一起的文本字幕对。然而,与人类标注的旁白相比,语音和字幕的自然区别在于视频的视觉内容,因此为多模态学习提供了嘈杂的监督。因此,大规模无标注的视频训练数据仍然不足以训练文本-视频模型。在这项工作中,我们试图利用大型语言模型的能力,根据ASR旁白生成与视频相符的精细视频描述。具体来说,我们提示一个大型语言模型(LLM)根据视频的ASR旁白创建合理的视频描述,以训练大规模教学视频数据集。为此,我们引入了一种提示方法,能够考虑字幕的较长文本,使我们能够捕捉到句子之外的信息。为了将旁白与视频的时间轴对齐,我们要求LLM根据每个产生的旁白生成相应的时刻。这样,我们在规模上获得了人类风格的视频旁白。我们将我们的方法应用于HowTo100M数据集中的旁白,创建了一个新的大型数据集HowToCaption。我们的评估显示,生成的旁白不仅显著提高了许多不同基准数据集的文本-视频检索性能,而且还有助于将文本叙述与音频分离,提高文本-音频任务中的性能。
https://arxiv.org/abs/2310.04900
With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.
过去几年中,Web视频的快速增长,使得大规模基于内容的视频检索(CBVR)在视频过滤、推荐和版权保护中变得越来越重要。分块级别的CBVR(S-CBVR)则在更细粒度上找到了类似片段的起始和结束时间,这对用户的浏览效率和侵犯检测特别有利。S-CBVR任务的挑战是如何在高效计算和低存储消耗的情况下实现高时间对齐精度。在本文中,我们提出了一个Segment Similarity and Alignment Network(SSAN)来解决这个挑战,这个挑战是在S-CBVR任务中首先训练端到端的任务。SSAN基于视频检索中新提出的两个模块:(1)高效的自监督关键帧提取(SKE)模块以减少冗余帧特征,(2)一个稳健的相似性模式检测(SPD)模块来进行时间对齐。与通用的帧提取相比,SKE不仅节省了特征存储和搜索时间,还引入了类似的精度和有限的额外计算时间。在时间对齐方面,SPDLocalization比现有的深度学习方法更准确且更高效。此外,我们与SKE和SPD一起训练SSAN,并实现了端到端改进。同时,这两个关键模块SKE和SPD也可以有效插入到其他视频检索管道中,并取得了显著的性能改进。公开数据集的实验结果显示,与现有方法相比,SSAN可以在节省存储和在线查询计算成本的同时获得更高的对齐精度。
https://arxiv.org/abs/2309.11091
In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
近年来,Web视频的爆炸使得文本-视频检索对于视频过滤器、推荐和搜索变得越来越重要和流行。文本-视频检索的目标是将相关的文本/视频排名高于无关的文本/视频。这项工作的核心在于精确测量文本和视频的跨modal相似性。最近,对比学习方法在文本-视频检索中取得了良好的结果,其中大部分方法专注于构建正交和反交对来学习文本和视频表示。然而,他们并没有足够重视坚固的负对对,并且缺乏建模不同级别的语义相似性的能力。为了解决这些问题,本文使用两个新技术来提高对比学习。首先,利用坚固的示例来增强鲁棒性,我们提出了一种新的双模注意力增强模块(DMAE),以从文本和视觉线索中挖掘坚固的负对对。此外,我们引入了一个负 aware infoNCE(NegNCE)损失,可以自适应地识别所有的坚固的负对对,并明确地突出它们在训练损失中的影响。其次,我们的工作认为三组样本比二组样本更好地模型精细语义相似性。因此,我们提出了一个新的三组 partial margin Contrastive Learning(TPM-CL)模块,以通过自动生成匹配的文本-视频对的精细坚固的负对对来构建 partial order 三组样本。该提出的 TPM-CL 设计了一个自适应的元掩码策略,通过跨modal交互来建模微妙的语义差异。广泛的实验表明,该方法在常用的文本-视频检索数据集上,包括 MSR-VTT、MSVD、DiDeMo 和活动网络中,表现出了比现有方法更好的性能。
https://arxiv.org/abs/2309.11082