The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at this https URL.
视频文本检索的常用方法利用了视觉和文本信息之间的粗调和精细对齐。然而,根据文本查询检索正确的视频通常非常困难,因为这需要对高级别(场景)和低级别(对象)的视觉线索以及它们如何与文本查询进行推理。为此,我们提出了一种名为 UCoFiA 的统一粗调和精细对齐模型。具体来说,我们的模型在不同粒度级别上捕获了跨模态相似性信息。为了减轻无关视觉线索的影响,我们还应用了一个交互相似性聚合模块(ISA),考虑不同视觉特征的重要性,同时聚合跨模态相似性以获取每个粒度的相似性分数。最后,我们应用了Sinkhorn-Knopp算法对每个级别的相似性进行归一化,然后在总和之前进行标准化,减轻不同级别的过度表示和不足表示问题。通过同时考虑不同粒度级别的跨模态相似性,UCoFiA 允许有效统一多粒度对齐。实验结果表明,UCoFiA 在多个视频文本检索基准测试中比先前最先进的Clip方法表现出色,在MSR-VTT、Activity-Net和DiDeMo等网站上,文本到视频检索R@1得分分别提高了2.4%、1.4%和1.3%。我们的代码在此httpsURL上公开可用。
https://arxiv.org/abs/2309.10091
Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.
许多研究关注改进文本-视频检索的前训练或开发新的骨架。然而,现有方法可能面临学习和推断偏见问题,就像最近在与其他文本-视频相关任务中研究的提示一样。例如,在动作识别中的 spatial appearance 特征或视频场景图生成中的时间对象同时出现可能会导致伪相关性。在这项工作中,我们提出了一种独特的系统和有组织的研究时间偏差,由于剪辑视频片段的训练和测试集之间的帧长度差异引起的,这是针对文本-视频检索任务的第一起尝试,据我们所知。我们首先提出了一种因果关系的偏见消除方法,并针对Epic-Kitchens-100、YouCook2和MSR-VTT数据集进行了广泛的实验和烧灼研究。我们的模型超越了基准模型和SOTA on nDCG,这是一个注重语义相关性的评价指标,证明了偏见已经减轻,同时超过了其他常规指标。
https://arxiv.org/abs/2309.09311
Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.
大型网络图像和文本数据集已经被证明是学习稳健视觉语言模型的有效工具。然而,将它们转移到视频检索任务时,模型仍然需要在手动编辑的配对文本-视频数据上进行微调,以适应各种视频描述风格。为了解决不需要手动编辑配对数据的问题,我们提出了一种新的设置,即文本-视频检索带未编辑和未配对数据的新场景,在训练期间仅使用文本查询和未编辑的 Web 视频,而不需要配对的文本-视频数据。为此,我们提出了一种方法,In-Style,来学习文本查询的风格并将其转移至未编辑的 Web 视频。此外,为了改善泛化能力,我们表明,一个模型可以训练多个文本风格。为此,我们引入了一种多风格对比训练程序,可以提高多个数据集的泛化能力。我们评估了模型在多个数据集上的检索性能,以展示我们的风格转移框架在新任务上的优点,即未编辑和未配对文本-视频检索任务,并提高了零样本文本-视频检索的性能。
https://arxiv.org/abs/2309.08928
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:this https URL
优化视频推断效率随着各个领域对视频分析的需求不断增加变得越来越重要。一些现有方法通过明确放弃空间或时间信息实现了高效的性能,但在快速变化和精细的场景下会带来挑战。为了解决这些问题,我们提出了一种高效的视频表示网络,采用可分化分辨率压缩和对齐机制。该网络在网络的早期阶段压缩非关键信息,以降低计算成本,同时保持 consistent 的时间相关度。具体来说,我们利用一种可分化上下文 aware 压缩模块编码可见和非可见帧特征,将它们 refine 和更新为高低频分辨率的视频序列。为了处理新的序列,我们引入了一种新分辨率 align Transformer 层,以捕捉不同分辨率帧特征之间的全局时间相关度,同时通过在低分辨率非可见帧中使用更少的空间 token 以减少空间计算成本。整个网络可以通过集成可分化压缩模块进行端到端优化。实验结果显示,与我们现有的方法相比,我们的方法在近同视频检索和动态视频分类中的效率和表现实现了最佳平衡。代码: this https URL
https://arxiv.org/abs/2309.08167
This paper presents a novel semi-supervised deep learning algorithm for retrieving similar 2D and 3D videos based on visual content. The proposed approach combines the power of deep convolutional and recurrent neural networks with dynamic time warping as a similarity measure. The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip based on its graphical frames and contents. We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network. We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method. This approach is tested on multiple public datasets, including CC\_WEB\_VIDEO, Youtube-8m, S3DIS, and Synthia, and showed good results compared to state-of-the-art. The algorithm effectively solves video retrieval tasks and outperforms the benchmarked state-of-the-art deep learning model.
本论文提出了一种基于视觉内容的相似性检索2D和3D视频的新半监督深度学习算法。该算法将深度卷积和循环神经网络的力量与动态时间扭曲作为相似性度量方法相结合。该算法旨在处理大规模的视频数据集,根据给定问题视频片段的图形帧和内容检索与之最相关的视频。我们将候选人和问题视频片段按照片段序列分解,使用自动编码器支持的深度学习网络将每个片段转换为表示向量。然后,使用双向动态时间扭曲方法计算两个向量序列之间的相似性度量。该方法在多个公共数据集上进行了测试,包括CC\_WEB\_VIDEO、 Youtube-8m、S3DIS和Synthia,与最先进的方法相比表现出良好的结果。该算法有效地解决了视频检索任务,并超越了基准先进的深度学习模型。
https://arxiv.org/abs/2309.01063
In this work, we present an approach to identify sub-tasks within a demonstrated robot trajectory using language instructions. We identify these sub-tasks using language provided during demonstrations as guidance to identify sub-segments of a longer robot trajectory. Given a sequence of natural language instructions and a long trajectory consisting of image frames and discrete actions, we want to map an instruction to a smaller fragment of the trajectory. Unlike previous instruction following works which directly learn the mapping from language to a policy, we propose a language-conditioned change-point detection method to identify sub-tasks in a problem. Our approach learns the relationship between constituent segments of a long language command and corresponding constituent segments of a trajectory. These constituent trajectory segments can be used to learn subtasks or sub-goals for planning or options as demonstrated by previous related work. Our insight in this work is that the language-conditioned robot change-point detection problem is similar to the existing video moment retrieval works used to identify sub-segments within online videos. Through extensive experimentation, we demonstrate a $1.78_{\pm 0.82}\%$ improvement over a baseline approach in accurately identifying sub-tasks within a trajectory using our proposed method. Moreover, we present a comprehensive study investigating sample complexity requirements on learning this mapping, between language and trajectory sub-segments, to understand if the video retrieval-based methods are realistic in real robot scenarios.
在本研究中,我们提出了一种方法,通过使用语言指令来识别演示中的机器人路径中的子任务。在演示过程中,我们使用语言指令来指导识别更长的机器人路径中的子片段。给定一组自然语言指令和一条包含图像帧和离散动作的长期路径,我们希望将指令映射到路径中的较小片段。与之前直接学习从语言到政策的映射的工作不同,我们提出了一种语言Conditioned的变化点检测方法来识别子任务。我们的方法学习长语言命令的组成片段和对应路径的组成片段之间的关系。这些组成路径片段可以用于学习子任务或子目标,以规划或选项,之前相关的工作已经证明了这一点。我们的研究表明,语言Conditioned的机器人变化点检测问题类似于现有的在线视频片段中识别子片段的工作。通过广泛的实验,我们证明了使用我们提出的方法在机器人路径中的子任务准确识别上比基准方法提高了1.78个百分点。此外,我们提出了一项全面的研究,调查了样本复杂性要求,学习语言和路径子片段之间的映射,以理解在线机器人场景中基于视频检索的方法是否现实。
https://arxiv.org/abs/2309.00743
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at this https URL.
合成图像检索(CoIR)最近变得越来越流行,成为一个涉及文本和图像查询的任务,以在数据库中寻找相关图像。大多数CoIR方法需要手动标注的数据集,包括图像-文本-图像三体,其中文本描述了从查询图像到目标图像的修改。然而,手动编辑CoIR三体是非常昂贵的,并且限制了可扩展性。在这项工作中,我们提出了一种可扩展的自动数据集创建方法,根据视频字幕对生成三体,同时也扩大了任务的范围,包括合成视频检索(CoVR)。为此,我们从一个大型数据库中检索具有相似字幕的配对视频,并利用大型语言模型生成相应的修改文本。将这个方法应用于广泛的WebVid2M集合中,我们自动构建了我们WebVid-CoVR数据集,产生了160万三体。此外,我们提出了一个新的CoVR基准集,并使用手动标注的评估集,与基线结果一起提出了。我们的实验进一步证明,在我们 dataset 中训练的 CoVR 模型有效地转移到了 CoIR,导致在CIVR和时尚IQ基准任务中零样本 setup 中提高最先进的性能。我们的代码、数据和模型都在这个 https URL 上公开可用。
https://arxiv.org/abs/2308.14746
To date, the majority of video retrieval systems have been optimized for a "single-shot" scenario in which the user submits a query in isolation, ignoring previous interactions with the system. Recently, there has been renewed interest in interactive systems to enhance retrieval, but existing approaches are complex and deliver limited gains in performance. In this work, we revisit this topic and propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task without access to ground truth dialogue data. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
迄今为止,大多数视频检索系统都被优化为一个“一次性”场景,即用户孤立地提交查询,并忽略与系统的先前交互。最近,人们对增强检索的互动系统再次感兴趣,但现有的方法非常复杂,性能增益非常有限。在这个研究中,我们重新考虑了这个主题,并提出了几种简单但有效的基线,通过问答方式实现互动视频检索。我们使用了一个视频QA模型来模拟用户交互,并证明这种方法可以在没有访问真实对话数据的情况下进行有成效的互动检索研究。在MSR-VTT、MSVD和AVSD的实验中,表明我们使用问题交互的框架显著改善了基于文本的视频检索系统的性能。
https://arxiv.org/abs/2308.10402
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
在文本-视频检索中,最近的工作利用了预先训练的文本-图像基础模型(例如CLIP)的强大学习能力,将其适应到视频领域。他们面临的一个重要问题是如何有效地利用Clip图像编码器将文本信息转换为视频帧表示,这在大规模检索系统中会导致严重的效率问题,因为每个文本查询都需要在线重新计算视频表示。在本文中,我们放弃了这个有问题的跨modal融合过程,并旨在从视频中直接学习语义增强的表示,以便视频表示可以 offline 计算并用于不同文本。具体来说,我们首先将空间-时间的“PromptCube”引入Clip图像编码器,并迭代地在编码层中切换它,以高效地将全球视频语义融入帧表示。然后,我们提出了应用辅助视频标题生成目标来训练帧表示的新方法,通过在语义空间提供细粒度的指导,促进详细视频语义的学习。使用简单的时间融合策略(即平均池化),我们在增强帧表示上取得了基准数据集上最先进的性能,即MSR-VTT、MSVD和LSMDC。
https://arxiv.org/abs/2308.07648
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.
对于旨在通过临时文本查询检索未标记的视频的文本到视频检索(T2VR),基于Clip的方法已经成为主导。与Clip4Clip高效且紧凑相比,最先进的模型倾向于通过精细的跨modal feature interaction和匹配计算视频文本相似性,这使其在大规模T2VR的可扩展性产生了疑问。为了高效实现T2VR,我们提出了 teachCLIP 多粒度教学方案,让基于Clip4Clip的学生网络从更先进但计算代价高昂的模型,如X-CLIP、TS2-Net和X-Pool学习。为了提高学生的学习能力,我们添加了一个注意帧特征聚合块(AFA),其设计在检索阶段不需要额外的存储或计算 overhead。虽然 AFA 生成的注意权重通常用于组合帧级特征,我们提出了一种新的使用方式,使其能够模仿老师网络估计的帧文本相关性。因此, AFA为学生提供了精细的学习(教学)通道。在多个公共数据集上进行广泛的实验,证明了该方法的可行性。
https://arxiv.org/abs/2308.01217
Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at this https URL
近年来,手术计算机视觉应用的进步主要是由完全监督的方法驱动,主要使用视觉数据。这些方法依赖于手动标注的手术视频来预测一组固定的对象类别,限制了其泛化到未观察到的手术程序和后续任务。在这项工作中,我们提出了一种想法,即通过开放手术 e-learning 平台提供的手术视频讲座,可以使用多种互补的自动语音识别系统来生成文本转录,从而为多模态表示学习提供有效的监督信号,而无需依赖于手动标注。我们处理了手术视频讲座中存在的语言特定挑战,通过使用多个互补的自动语音分类系统来生成文本转录,解决了这些问题。随后,我们介绍了一种新方法 SurgVLP - surgical Vision Language Pre-training,用于多模态表示学习。 SurgVLP 建立了一种新的对比度学习目标,通过将视频片段嵌入子隐空间中,将它们合并在一起,来将视频片段嵌入文本嵌入中。为了有效地展示学习到的子隐空间表示能力,我们介绍了几个与手术有关的视与语言任务,例如基于文本的视频检索、时间活动基座和视频captioning,作为评估基准。我们还证明了,不使用任何标记的 ground truth,我们的方法可以用于传统的仅视觉的手术后续任务,例如手术工具、阶段和三对子识别。代码将在这个 https URL 上提供。
https://arxiv.org/abs/2307.15220
Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by ECLIPSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately. Our proposed method's efficacy is demonstrated on four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and Charades, and achieves better than state-of-the-art performance consistently across the four datasets. This is attributed to the additional text-query-conditioned audio representation and the complementary information it adds to the text-query-conditioned video representation.
文本到视频检索系统最近通过利用大规模图像文本对对训练的预训练模型取得了显著进展。然而,大多数最新的方法主要关注视频模式,而对于这项任务音频信号则被忽略了。然而,Eclipse的一项最新进展改进了远距离文本到视频检索,开发了一个视频多视图表示。相反,我们采用了两个独立的跨媒体注意力块,这些块使文本能够分别关注视频和视图表示,我们的假设方法的有效性在包含音频的四个基准数据集上进行了测试,其中音频数据集包括语音:MSR-VTT、LSMDC、 VATEX 和游戏,在四个数据集中 consistently 取得了比当前最先进的性能更好的表现。这得益于额外的文本查询条件音频表示和它加入到文本查询条件的视频表示的互补信息。
https://arxiv.org/abs/2307.12964
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications, its current dominant tasks focus on online detecting anomalies% at the frame level, which can be roughly interpreted as the binary or multiple event classification. However, such a setup that builds relationships between complicated anomalous events and single labels, e.g., ``vandalism'', is superficial, since single labels are deficient to characterize anomalous events. In reality, users tend to search a specific video rather than a series of approximate videos. Therefore, retrieving anomalous events using detailed descriptions is practical and positive but few researches focus on this. In this context, we propose a novel task called Video Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant anomalous videos by cross-modalities, e.g., language descriptions and synchronous audios. Unlike the current video retrieval where videos are assumed to be temporally well-trimmed with short duration, VAR is devised to retrieve long untrimmed videos which may be partially relevant to the given query. To achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we propose an anomaly-led sampling to focus on key segments in long untrimmed videos. Then, we introduce an efficient pretext task to enhance semantic associations between video-text fine-grained representations. Besides, we leverage two complementary alignments to further match cross-modal contents. Experimental results on two benchmarks reveal the challenges of VAR task and also demonstrate the advantages of our tailored method.
视频异常检测(VAD)因其潜在应用而受到越来越多的关注,目前其主要任务集中在帧级别的在线异常检测%,可以粗略地解释为二进制或多个事件分类。然而,建立一个包含复杂异常事件和单个标签的关系,例如“破坏”等,的关系式,似乎是表面的,因为单个标签不足以描述异常事件。实际上,用户通常搜索特定的视频而不是一系列近似视频。因此,使用详细描述来检索异常事件是实际而且积极的,但只有少数研究关注这一点。在这种情况下,我们提出了一项新的任务,称为视频异常检索(VAR),旨在通过跨媒体方式,例如语言描述和同步音频, pragmatically检索相关的异常视频。与当前视频检索假设视频时间片段被很好地修剪,并且持续时间较短,VAR旨在检索未修剪的视频,这些视频可能部分与给定查询相关。为了实现这一点,我们介绍了两个大规模VAR基准集,UCF Crime-AR和XDViolence-AR,建立在现有异常数据集的顶部。同时,我们设计了名为异常引导对齐网络(ALAN)的模型,在ALAN中,我们提出了异常引导采样,以关注未修剪视频中的关键部分。然后,我们引入了一个高效的预编程任务,以增强视频文本细粒度表示之间的语义关联。此外,我们利用两个互补的对齐来进一步匹配跨媒体内容。两个基准集的实验结果揭示了VAR任务的挑战,同时也展示了我们的定制方法的优势。
https://arxiv.org/abs/2307.12545
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP model strategy in fine-grained retrieval, enabling scalability to larger pre-trained vision models like ViT-G, resulting in improved retrieval performance. Experiments on text video retrieval datasets demonstrate the effectiveness and scalability of our proposed CrossTVR compared to state-of-the-art approaches.
先进的文本-视频检索(TVR)方法通常利用CLIP和cose相似性来计算高效的检索。同时,交叉注意力方法使用Transformer解码器计算每个文本查询和视频帧之间的注意力,提供了文本和视频更加全面的交互。然而,这些方法缺乏重要的精细空间信息,因为它们直接计算文本和视频级别元的信息。为了解决这一问题,我们提出了CrossTVR,一种两阶段的文本-视频检索架构。在第一阶段,我们利用现有的TVR方法以及cose相似性网络来进行文本/视频候选人高效的选择。在第二阶段,我们提出了一种新颖的分离视频文本交叉注意力模块,以捕捉空间和时间上的精细多模态信息。此外,我们在精细检索中也使用了 frozen CLIP 模型策略,使其可以扩展到更大的预训练视觉模型,如ViT-G,从而提高检索性能。对文本-视频检索数据集的实验表明,我们提出的CrossTVR相对于最先进的方法的效率和可扩展性。
https://arxiv.org/abs/2307.09972
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.
本论文介绍了InternVid,一个大规模的视频中心多任务数据集,旨在学习强大的、可转移的视频文本表示,以进行多任务理解和生成。InternVid数据集包含超过7百万视频,时长近760K小时,生成234M视频片段,并包含总共4.1B字的详细描述。我们的核心贡献是开发一种可扩展的方法,以自主地构建高质量的视频文本数据集,使用大型语言模型(LLM),从而展示它在大规模视频语言表示学习中的有效性。具体来说,我们使用多尺度方法生成相关的视频描述。此外,我们介绍了ViCLIP,一个基于ViT-L的视频-文本表示学习模型。通过对比学习,该模型在InternVid上学习了零次行动识别和竞争视频检索性能。除了基本的视频识别和检索任务,我们的数据和模型具有广泛的应用。它们特别有利于生成交互式视频文本数据,学习视频中心对话系统,推进视频到文本和文本到视频生成研究。这些提议的资源为对多任务视频理解和生成的研究人员和从业者提供了一个工具。
https://arxiv.org/abs/2307.06942
Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.
生成用于视觉叙事的视频是一项繁琐而复杂的任务,通常需要进行实地拍摄或图形动画渲染。要绕过这些挑战,我们的关键想法是利用现有的视频片段的丰富性,以定制其外观的方式合成一部连贯的叙事视频。我们通过开发一个由两个功能模块组成的框架来实现这一点:(一)运动结构检索,该模块提供根据查询文本描述所需场景或运动环境的的视频候选者,(二)结构引导文本-视频合成,该模块在运动结构和文本提示的引导下生成与故事情节对齐的视频。对于第一个模块,我们利用现有的视频检索系统提取视频深度,将其用作运动结构。对于第二个模块,我们提出了可控制的视频生成模型,提供对结构和人物的灵活控制。视频通过遵循结构指导和外观指示进行合成。为了确保片段之间的视觉一致性,我们提出了一种有效的概念个性化方法,通过文本提示来指定所需的角色身份。广泛的实验表明,我们的方法在各种现有基准项中表现出显著的优势。
https://arxiv.org/abs/2307.06940
Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.
日常新闻报道已经从传统的广播转向了各种表现形式,例如未经编辑的现场视频。可以用于教学模型,从各种在线多语言新闻来源中收集数据,但这些现有新闻视频数据集主要面向为英语观众制作的传统新闻广播。我们克服了这一限制,建立了MultiVENT,一个基于五个目标语言文本文档的多语言事件视频数据集。MultiVENT包括新闻广播视频和非专业事件视频,我们使用这些视频来分析在线新闻视频的状态,并如何利用它们构建稳健、事实准确的模型。最后,我们提供了复杂的多语言视频检索模型,作为使用MultiVENT进行信息检索的基准。
https://arxiv.org/abs/2307.03153
Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects \& attributes and actions are joined using correct semantics to form a proper text query. These components (objects \& attributes, actions and semantics) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and semantic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and semantics play a minor role compared to objects \& attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better semantic and compositional understanding as compared to models pre-trained on video-text data.
视频检索(VR)涉及从视频数据库中检索正确的基准视频,给定文本caption或反之。构成性的两个重要组成部分:对象和属性以及行动,使用正确的语义组成一个适当的文本查询。这些组成部分(对象和属性,行动和语义)每个都扮演着重要角色,帮助区分视频并检索正确的基准视频。然而,它不清楚这些组成部分对视频检索性能的影响。因此,进行一项系统研究,评估视频检索模型的标准基准如MSRVTT、MSVD和DI Demo的标准框架,并对视频检索模型的构成性和语义理解进行评估。研究涉及两个类型的视频检索模型:(i) 预先训练在视频文本对上,并在后续视频检索数据集上微调(例如, frozen-in-time、Violet、MQC等)的视频检索模型;(ii) 适应类似于CLIP的图像文本表示,用于视频检索(例如,CLIP4Clip、XCLIP、CLIP2Video等)的视频检索模型。我们的实验表明,行动和语义在视频理解中发挥着较小的作用,相比之下,对象和属性更为重要。此外,使用预先训练的图像文本表示(CLIP)的视频检索模型具有更好的语义和理解,与视频文本数据预先训练的模型相比。
https://arxiv.org/abs/2306.16533
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, tasks-based evaluation supported by metrology. Over the last twenty-one years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID has been funded by NIST (National Institute of Standards and Technology) and other US government agencies. In addition, many organizations and individuals worldwide contribute significant time and effort. TRECVID 2022 planned for the following six tasks: Ad-hoc video search, Video to text captioning, Disaster scene description and indexing, Activity in extended videos, deep video understanding, and movie summarization. In total, 35 teams from various research organizations worldwide signed up to join the evaluation campaign this year. This paper introduces the tasks, datasets used, evaluation frameworks and metrics, as well as a high-level results overview.
TREC Video Retrieval Evaluation (TRECVID) 是一种基于 TREC 风格的视频分析和检索评估,旨在通过开放、任务为基础的评估支持计量学,促进基于内容的利用和检索从数字视频的研究和开发进展。在过去的二十年中,这种努力已经导致了更好的理解,如何系统能够有效地完成这种处理,以及如何可靠地基准其性能。TRECVID 由 NIST(National Institute of Standards and Technology)和其他美国政府机构资助。此外,世界各地许多组织和个人都提供了巨大的时间和努力。TRECVID 2022 计划完成以下六个任务:临时视频搜索、视频到文本翻译、灾难场景描述和索引、扩展视频活动、深度视频理解,以及电影概述。总共有来自世界各地的35个团队 signed up 加入评估活动。本文介绍了任务、使用的dataset、评估框架和指标,以及高层次结果概述。
https://arxiv.org/abs/2306.13118
Multimodal learning on video and text data has been receiving growing attention from many researchers in various research tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Although many algorithms have been proposed for those challenging tasks, most of them are developed on English language datasets. Despite Indonesian being one of the most spoken languages in the world, the research progress on the multimodal video-text with Indonesian sentences is still under-explored, likely due to the absence of the public benchmark dataset. To address this issue, we construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences. Using our dataset, we then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. The recent neural network-based approaches to video-text tasks often utilized a feature extractor that is primarily pretrained on an English vision-language dataset. Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable. To overcome the lack of pretraining resources, we apply cross-lingual transfer learning by utilizing the feature extractors pretrained on the English dataset, and we then fine-tune the models on our Indonesian dataset. Our experimental results show that this approach can help to improve the performance for the three tasks on all metrics. Finally, we discuss potential future works using our dataset, inspiring further research in the Indonesian multimodal video-text tasks. We believe that our dataset and our experimental results could provide valuable contributions to the community. Our dataset is available on GitHub.
视频和文本数据的modal learning已经吸引了许多研究人员在各种研究任务中越来越多的关注,包括文本到视频检索、视频到文本检索和视频翻译。尽管印度尼西亚是世界上使用最广泛的语言之一,但针对那些具有挑战性的任务的 multimodal 视频文本研究进展仍然相对较少,可能由于缺乏公共基准数据。为了解决这一问题,我们创造了第一个公共的印度尼西亚视频文本数据集,通过将英语句子翻译为印度尼西亚句子来构建它。使用我们的数据集,我们然后训练了为英语视频文本数据集开发的神经网络模型,这三个任务:文本到视频检索、视频到文本检索和视频翻译。最近的神经网络视频文本任务的方法通常使用一个主要基于英语视觉语言数据集的特征提取器。由于可用的印度尼西亚句子的前训练资源相对较有限,这些方法对我们数据集的适用性仍然值得怀疑。为了克服缺乏前训练资源的问题,我们应用跨语言迁移学习,利用在英语数据集上预训练的特征提取器,然后在我们的印度尼西亚数据集中微调模型。我们的实验结果表明,这种方法可以帮助提高这三个任务在所有指标上的表现。最后,我们讨论了使用我们的数据集可能激发的印度尼西亚 multimodal 视频文本任务的未来研究。我们相信我们的数据和实验结果可以为社区提供有价值的贡献。我们的数据集在GitHub上可用。
https://arxiv.org/abs/2306.11341