Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at this https URL.
基线模型在多个领域中表现出卓越的性能和泛化能力。由于大多数基线模型研究主要关注预训练阶段,因此一种简单的策略是最小化一个特定任务的损失,用于微调。然而, such微调方法并未充分利用可能对目标任务有益的其他损失。因此,我们提出了 Melta LossTRansformer(MELTR),它是一个插件模块,自动和非线性地组合各种损失函数,以协助通过辅助学习学习目标任务。我们将辅助学习表示为两个水平的优化问题,并提出了基于approximate Implicit differentiation(AID)的高效优化算法。为了评估,我们应用我们的框架对各种视频基线模型(UniVL、Violet和All-in-one)进行训练,并在所有四个后续任务中表现出显著的性能提升:文本到视频检索、视频问答、视频字幕和多模态情感分析。我们定性分析表明,MELTR适当地`transforms' individual损失函数,并将其`融化'为有效的统一损失。代码可在该 https URL 上获取。
https://arxiv.org/abs/2303.13009
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video frames, speech transcripts, and video captions) to ensure the quality of caption results. In addition, the cross attention mechanism is also used in between the two streams for sharing information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSRVTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset.
视频字幕旨在用自然语言描述视频内容。尽管已经取得了很大进展,但针对实际应用程序的性能仍有很多可以提高的空间,主要原因是长词挑战。在本文中,我们提出了一种基于知识图增强的文本transformer(TextKG)用于视频字幕。值得注意的是,TextKG是一个由外部流和内部流组成的二元transformer,以外部流和内部流为基础。外部流旨在吸收额外的知识,以模拟额外的知识、例如预先构建的知识图以及视频内置信息,例如引人注目的对象区域、语音转录和视频字幕,以减轻长词挑战。与此同时,内部流旨在利用视频的多媒体信息(例如视频帧的外观、语音转录和视频字幕),以确保字幕结果的质量。此外,在两个流之间还使用了交叉注意力机制来共享信息。因此,两个流可以互相帮助,获得更准确的结果。在四个具有挑战性的视频字幕数据集上进行了广泛的实验,包括YouCookII、ActivityNetcaptions、MSRVTT和MSVD,结果表明,我们提出的方法在与当前最佳方法的比较中表现良好。具体而言,我们提出的TextKG方法在YouCookII数据集上比最佳公开结果提高了18.7%的绝对CIDEr得分。
https://arxiv.org/abs/2303.12423
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at this https URL
Sequential video understanding,作为新兴的视频理解任务,吸引了许多研究人员的关注,因为它具有目标导向的性质。本文研究了未提供准确时间戳级别文本-视频对齐的弱监督Sequential视频理解任务。我们借鉴了CLIP的思想,具体来说,我们使用Transformer将帧级特征整合用于视频表示,使用预先训练的文本编码器分别编码每个行动和整个视频对应的文本。为了建模文本和视频之间的对应关系,我们提出了多个粒度的损失,其中视频段落对比度损失强迫整个视频和完整脚本匹配,而精细的帧语句对比度损失强迫每个行动和其描述匹配。由于帧语句对应关系不可得,我们提出了利用时间域中视频行动Sequential的顺序性生成伪帧语句对应关系,并监督网络使用伪标签进行训练。在视频序列验证和文本到视频匹配方面的广泛实验结果表明,我们的方法比基准方法表现更好,这验证了我们提出的方法的有效性。代码可在该https URL处获取。
https://arxiv.org/abs/2303.12370
The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: this https URL.
CLIP模型最近被证明对于多种跨modal任务非常有效,包括从视觉和语言架构生成的caption评估。在本文中,我们提出了一种新的 recipe 用于图像captioning 的评价指标,即增强对比学习得分( PAC-S),以一种 novel 的方式将对比视觉语义学习与编辑数据生成图像和文本相结合。多个数据集的实验表明,我们的新指标在图像和视频对人类判断的相关性方面表现最优秀,比现有的基于参考的指标(如 CIDEr 和 SPICE )以及无参考指标(如 CLIP-Score)更好。最后,我们测试了 proposed 指标的系统级相关性,在考虑流行的图像captioning方法时,并评估了使用不同跨modal特征的影响。我们源代码和训练模型的公开地址为: this https URL.
https://arxiv.org/abs/2303.12112
The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished.
深度学习模型的成功导致了其适应和采用 prominent 视频理解方法。这些方法大多数都采取了 joint 空间-时间模式,该模式的特点是内部运作和学习表示难以视觉解释。我们提出了LEarned Preconscious Synthesis(LEAPS),一种无架构方法,用于从模型内部时间和空间表示中合成视频。使用刺激视频和目标类,我们初始化一个固定的空间-时间模型,并迭代优化一个以随机噪声初始化的视频。我们引入了额外的正则化,以提高合成视频的特征多样性和交叉帧时间一致性。我们通过反转训练在Kinetics-400上训练的各种时间和空间卷积和注意力架构,评估了 LEAPS 的适用性。据我们所知,此前从未实现过。
https://arxiv.org/abs/2303.09941
In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DualPath. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DualPath can be effectively generalized beyond the data domain.
在本文中,我们高效地将视觉基础模型,如ViT和Swin,的超能表示能力用于仅需要几个可训练参数的视频理解。以前的适应方法同时考虑了空间时间和建模的统一学习模块,但仍然无法充分利用图像转换器的代表能力。我们认为,视频模型中的流行的双路径(两路)架构可以解决这个问题。我们提出了一种独特的 DualPath 适应方法,将其分解为空间时间和适应路径,在每个转换块中使用轻量级瓶颈适配器。特别是对于时间动态建模,我们将连续帧放入grid-like帧集合中,精确模拟图像转换器的能力,预测符号之间的关系。此外,我们在视频理解中广泛研究了多个基线,并将它们与 DualPath 进行比较。对于四个行动识别基准测试,实验结果证明,使用 DualPath 预先训练的图像转换器可以 effectively 扩展到数据域之外。
https://arxiv.org/abs/2303.09857
Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the importance of applying long-term temporal context modeling (TCM) blocks to the extracted video clip features such as employing complex self-attention mechanisms. In this paper, we present the simplest method ever to address this task and argue that the extracted video clip features are already informative to achieve outstanding performance without sophisticated architectures. To this end, we introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features with a basic, parameter-free, and local region operating max-pooling block. Picking out only the most critical information for adjacent and local clip embeddings, this block results in a more efficient TAL model. We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term TCM such as self-attention on various TAL datasets while requiring significantly fewer parameters and computational resources. The code for our approach is publicly available at this https URL
时间动作定位(TAL)是在视频理解中具有挑战性的任务,旨在识别和定位视频序列中的行动。最近的研究表明,应用长期时间上下文建模(TCM)块提取提取的视频片段特征,如使用复杂的自注意力机制非常重要。在本文中,我们介绍了解决此任务最简单的方法,并认为提取的视频片段特征已经具有信息,在没有 sophisticated 架构的情况下,以出色的性能为目标。为此,我们介绍了TemporalMaxer,该方法最小化长期时间上下文建模,同时最大化从提取的视频片段特征中获取的信息,使用基本、参数免费的局部区域最大池化块。仅选择相邻和局部片段嵌入的最重要信息,该块生成更高效的TAL模型。我们证明了TemporalMaxer在多个TAL数据集上优于其他利用长期TCM的先进技术,如自注意力,在各种TAL数据集上表现优异,同时只需要较少的参数和计算资源。我们的算法代码在此https URL上公开可用。
https://arxiv.org/abs/2303.09055
Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using ALFRED dataset [52] generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].
现有的Dense或段落视频翻译方法依赖于对整个视频的表示,可能结合学习的对象/行动表示,以条件Hierarchical语言解码器。然而,它们从根本上缺乏对世界的常识知识,以推理事件的进展、因果关系,甚至特定对象在场景中的功能。为了解决这个问题,我们提出了一种新的视频翻译Transformer-based模型,该模型考虑了隐含(视觉和语言)和显式(知识库)常识知识。我们证明,这些形式的知识孤立地和组合地增强生成的翻译质量。此外,受到模仿学习启发,我们提出了一个新的指令生成任务,其目标是从视频演示中生成一组语言指令。我们用ALFRED数据集[52]通过AI2-THOR环境生成。虽然指令生成的概念上类似于段落翻译,但它与 paragraphcaptioning 不同,表现出更强的对象持续性,以及具有空间意识和因果句子结构。我们证明,我们的常识知识增强方法在这项工作中取得了显著改善( METEOR 达到57%,CIDEr 达到8.5%),并取得了更传统的视频翻译方法在ActivityNetcaptions数据集上的最佳结果。
https://arxiv.org/abs/2303.07545
Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.
多模态处理最近吸引了很多关注,特别是在预训练成功的情况下。然而,探索主要集中在视觉语言预训练,因为引入更多的模态可能会极大地复杂模型设计和优化。在本文中,我们将扩展最先进的视觉语言模型Clip,以容纳视觉模态,进行视觉语言-音频多模态处理。具体来说,我们应用跨模态和内模态对比学习来探索音频和其他模态之间的关系,除了音频模态的内在特征。此外,我们还设计了音频类型元,以动态地学习不同音频信息类型,因为一般音频中包括既有口头又有非口头的 heterogeneous信息。我们提出的Clip4VLA模型在包括视频检索和视频captioning的不同后续任务中进行了验证,并在MSR-VTT、VTX和AudioCaps等基准数据集上实现了最先进的性能。
https://arxiv.org/abs/2303.06591
Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.
自然语言生成(NLG)接受输入数据的形式为图像、视频或文本,并生成相应的自然语言文本作为输出。现有的NLG方法主要采用监督方法,并 heavily 依赖耦合的数据-to-text pairs。然而,对于许多目标场景和非英语语言,大量的标记数据往往 not 可用。为了放松下游任务标记数据的依赖,我们提出了一个直观且有效的零-shot learning框架,ZeroNLG,它可以处理多个NLG任务,包括图像-to-text(图像captioning)、视频-to-text(视频captioning)和文本-to-text(神经网络翻译),在英语、中文、德语和法语的统一框架内处理。ZeroNLG不需要任何标记的下游对偶进行训练。在训练期间,ZeroNLG (i) 将不同领域(跨模态和语言)映射到共同的共同隐状态空间中的对应坐标;(ii) 通过在该空间中对齐不同领域的对应坐标桥接不同领域;(iii) 构建无监督的多语言自编码器,学习通过重构在共同隐状态空间中的输入文本来生成文本,因此,在推理期间,基于数据-to-text管道,ZeroNLG可以根据输入数据在共同空间中的坐标生成不同语言的 target 句子。在这统一框架内,给定视觉(成像或视频)数据作为输入,ZeroNLG可以执行零-shot的视觉captioning;给定文本句子作为输入,ZeroNLG可以执行零-shot机器翻译。我们提供了对十二个NLG任务的广泛实验结果,表明,在没有使用任何标记的下游对偶进行训练的情况下,ZeroNLG生成高质量的、可信的输出,并显著优于现有的零-shot方法。
https://arxiv.org/abs/2303.06458
Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge.
过去几年, Joint video-language learning 吸引了越来越多的关注。然而,现有的工作主要关注剪辑视频(事件)的单个或多个片段,这导致在推理时需要人类标注的事件边界。为了摆脱束缚,我们提出了一种针对未剪辑视频的基层视觉语言学习框架,该框架自动检测信息事件并有效地挖掘多 sentence 描述和相应事件Segment之间的对齐。我们不再采用粗粒度的视频-语言对齐,而是提出了两个双任务目标,以鼓励更细粒度的Segment 级对齐,即文本到事件 grounded(TEG) 和事件到文本 generation(ETG)。TEG 学习通过估计在共同语义空间中的情感距离估计出可能的事件提议,同时,ETG 旨在根据事件提议重构(生成)匹配的文本,鼓励事件表示保留有意义的语义信息。为了鼓励准确的事件和文本集合之间的标签分配,我们提出了一种新的语义aware成本,以减轻由于歧义边界标注引起的劣化匹配结果。我们的框架可以轻松扩展到涉及视觉grounded 语言理解和生成的任务。我们在ActivityNetcaptions、YouCook2 和 YouMakeup等平台上实现了先进的高密度视频字幕性能,并在多个其他语言生成和理解任务中表现出竞争力。我们的方法还获得了 PIC 4 挑战MTVG 和 MDVC 任务的第一名。
https://arxiv.org/abs/2303.06378
Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors not supported by the input video. These factual errors can seriously affect the quality of the generated text, sometimes making it completely unusable. Although factual consistency has received much research attention in text-to-text tasks (e.g., summarization), it is less studied in the context of vision-based text generation. In this work, we conduct a detailed human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. However, existing evaluation metrics are mainly based on n-gram matching and show little correlation with human factuality annotation. We further propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning. The datasets and metrics will be released to promote future research for video captioning.
视频字幕旨在用自然语言描述视频事件。近年来,许多工作都关注改进字幕模型的性能。然而,与其他文本生成任务一样,它可能引入输入视频不支持的事实错误。这些事实错误可以严重影响生成的文本质量,有时使其完全不可用。尽管事实一致性在文本到文本任务(如总结)中受到大量研究关注,但在基于视觉文本生成的背景下研究较少。在本工作中,我们进行了详细的人类评估视频字幕中的事实准确性,并收集了两个标注事实准确性的数据集。我们发现,模型生成的语句中57.0%存在事实错误,这表明该领域这是一个严重的问题。然而,现有的评估指标主要基于词袋匹配,并缺乏与人类事实准确性标注的相关性。我们进一步提出了一种弱监督的基于模型的事实准确性指标FactVC,它在视频字幕事实准确性评估方面表现优异。数据和指标将发布以促进视频字幕研究的未来研究。
https://arxiv.org/abs/2303.02961
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the video paragraph captioning task and the standard task of video clip captioning. Our code and models will be publicly released at this https URL.
在本研究中,我们介绍了Vid2Seq,一种多模态 single-stage 密度事件翻译模型,其训练数据来自于规模巨大的 narrated视频。Vid2Seq 架构增加了一种特殊的时间token,使其能够在相同的输出序列中无缝预测事件边界和文本描述。这种统一的模型需要大规模训练数据,目前尚未出现在标注的dataset中。我们表明,可以利用未标注的 narrated 视频进行密度视频翻译,通过将手写 speech 句子的句边界重新定义为伪事件边界,并将其用作伪事件翻译句子。基于 YT-Temporal-1B 数据集训练的 Vid2Seq 模型在多种密度视频翻译基准上提高了水平,包括 YouCook2、ViTT 和活动Netcaptions。Vid2Seq 还适用于视频段落翻译任务和视频片段翻译标准任务。我们的代码和模型将在这个 https URL 上公开发布。
https://arxiv.org/abs/2302.14115
To address the problem of medical image recognition, computer vision techniques like convolutional neural networks (CNN) are frequently used. Recently, 3D CNN-based models dominate the field of magnetic resonance image (MRI) analytics. Due to the high similarity between MRI data and videos, we conduct extensive empirical studies on video recognition techniques for MRI classification to answer the questions: (1) can we directly use video recognition models for MRI classification, (2) which model is more appropriate for MRI, (3) are the common tricks like data augmentation in video recognition still useful for MRI classification? Our work suggests that advanced video techniques benefit MRI classification. In this paper, four datasets of Alzheimer's and Parkinson's disease recognition are utilized in experiments, together with three alternative video recognition models and data augmentation techniques that are frequently applied to video tasks. In terms of efficiency, the results reveal that the video framework performs better than 3D-CNN models by 5% - 11% with 50% - 66% less trainable parameters. This report pushes forward the potential fusion of 3D medical imaging and video understanding research.
为了解决医学图像识别问题,像卷积神经网络(CNN)这样的计算机视觉技术经常被使用。最近,基于3D CNN模型的分类器在磁共振成像(MRI)分析领域占据了主导地位。由于MRI数据和视频之间的高度相似性,我们进行了广泛的实证研究,以研究视频分类技术对MRI分类的影响,以回答以下问题:(1)我们可以直接使用视频分类模型进行MRI分类吗?(2)哪种模型更适合MRI分类?(3)视频识别中的常见的增强技巧,如数据增强,对于MRI分类仍然有用吗?我们的研究表明,先进的视频技术有助于MRI分类。在本文中,使用阿尔茨海默病和帕金森病识别的四个数据集,以及三个不同的视频分类模型和常用的视频任务增强技术。从效率上来看,结果表明,视频框架比3D-CNN模型表现更好,下降了5%至11%,训练参数少50%至66%。本报告推进了3D医学成像和视频理解研究的 potential fusion。
https://arxiv.org/abs/2302.12688
Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).
尽管大规模的视频语言预训练模型,通常建立视频和文本的全局对齐,在多种下游任务中取得了显著进展,但在预训练阶段采用精细信息的想法并没有得到充分的探索。在本文中,我们提出了STOA-VLP,一个预训练框架,跨空间和时间维度同时建模对象和动作信息。更具体地说,模型将帧间对象轨迹和视频中的多个行动特征视为精细特征。此外,我们设计了两个辅助任务,更好地将这两种信息纳入视频语言模型的预训练过程。第一个任务是动态对象-文本对齐任务,建立对象轨迹与相关名词词干之间的联系。第二个任务是空间-时间行动集预测,指导模型通过预测文本中的行动生成一致的行动特征。对三个下游任务(视频字幕制作、文本-视频检索和视频问答)进行的广泛实验证明了我们提出的STOA-VLP的有效性(例如,与前一种方法相比,MSR-VTT视频字幕制作基准中3.7 Rouge-L改进了,MSVD视频问答基准中2.9%的精度提高了)。
https://arxiv.org/abs/2302.09736
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an egocentric video and a visual, textual or activity query, the goal is to determine when and where the answer can be seen within the video. Our model design is inspired by recent query-based approaches to spatio-temporal grounding, and contains modality-specific query encoders and task-specific sliding window inference that allow multi-task training with diverse input modalities and different structured outputs. We exhaustively analyze relationships among the tasks and illustrate that cross-task learning leads to improved performance on each individual task, as well as the ability to generalize to unseen tasks, such as zero-shot spatial localization of language queries.
视频理解任务有许多形式,包括动作检测、视觉查询定位和句子空间-时间基座的确定。这些任务在输入和输出类型上有所不同(仅输入视频,或输入视频和查询的一对视频-查询对,其中查询是一个图像区域或句子),但它们的核心要求是对视频的相同基本理解,即视频中的演员和物体、他们的行为和相互作用。迄今为止,这些任务一直是以个体高度专业化的结构单独处理的,这些结构并未充分利用任务之间的交互作用。相比之下,在本文中,我们提出了一个统一的方法,来处理基于查询的长视频理解任务。特别是,我们的模型可以处理Ego4D 经验记忆基准测试中涉及不同查询形式的三任务:给定一个自我意识的视频和一个视觉、文本或活动查询,目标是确定在视频中何时和何处可以看到答案。我们的模型设计受到最近基于查询的空间-时间基座方法的启发,并包含特定的modality-specific查询编码器和任务- specific 滑动窗口推断,以进行多任务训练,使用多种输入modality和不同结构的输出。我们深入分析了任务之间的关系,并证明了跨任务学习会导致每个单独任务的性能改善,以及能够对未接触过的任务进行泛化的能力,例如语言查询的空间定位。
https://arxiv.org/abs/2302.08063
Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is \url{this https URL}.
最近的视觉Transformer基于视频模型大多遵循“图像预训练然后微调”范式,并在多个视频基准上取得了巨大成功。然而,完全微调这样的视频模型可能会计算代价昂贵且没有必要,因为预训练的图像Transformer模型已经证明了出色的可转移性。在这项工作中,我们提出了一种新的方法来适应预训练的图像模型(AIM),以高效视频理解。通过冻结预训练图像模型并添加几个轻量级适配器,我们引入了空间适应、时间适应和联合适应,逐渐为图像模型配备时间和空间推理能力。我们表明,我们提出的AIM可以在四个视频动作识别基准上实现竞争或甚至更好的性能,而可调整参数数量大大减少。由于它的简单性,我们的方法也适用于不同的预训练图像模型,这有可能在未来的工作中利用更强大的图像基元模型。项目页面如下所示: url{this https URL}。
https://arxiv.org/abs/2302.03024
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in this https URL.
近年来,语言、视觉和多模态预训练领域经历了巨大的聚合。在本研究中,我们提出了mPLUG-2,这是一种新统一的范式,采用模块化设计,用于多模态预训练。这种范式可以利用模态合作,同时解决模态纠缠问题。与主要范式仅依靠序列到序列生成或编码实例 discrimination 单一依赖不同,mPLUG-2引入了多个模块组合网络,通过共享模态合作通用模块,并分离不同模态模块来处理模态纠缠。它可以灵活地选择不同的模块,以处理包括文本、图像和视频的所有模态理解任务和单模态任务,包括文本、图像和视频理解。实证研究表明,mPLUG-2在超过30个后续任务中实现了先进的或竞争的结果,涵盖了图像文本和视频文本理解与生成、仅文本、仅图像和仅视频理解的任务。特别是,mPLUG-2在挑战性的MSRVTT视频QA和视频字幕任务中展示了新的前沿结果,达到了48.0%的top-1精度和80.3的CIDEr,模型大小和数据规模都非常小。它还表现出强大的零样本迁移能力,在视觉语言和视频语言任务中。代码和模型将在本httpsURL中发布。
https://arxiv.org/abs/2302.00402
Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via late fusion methods. Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding. Moreover, our model achieves the new state of the art on four video-language datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy, and MSRVTT-Caption (+3.6) in CIDEr.
高效的视频-语言建模应该考虑计算成本,因为视频帧数量庞大,有时甚至无法处理。例如,注意力机制等参数化方法可能不是理想的,因为其计算成本随着视频长度的增加呈quadratic增长。相反,以前的研究依赖离线特征提取或帧采样来高效地表示视频,重点在短小的视频片段中的跨媒体建模。在本文中,我们提出了一种半参数化的视频grounded文本生成模型,SeViT,提出了一种对长期未修剪视频的 scalable 视频-语言建模的新视角。将视频视为外部数据存储,SeViT包括一个非参数帧检索器,以选择数据存储中与给定查询相关的一些帧,并一个参数化生成器,通过 late fusion方法有效地合并与查询相关的帧。实验结果显示,我们的方法和更长的视频以及及时视频理解具有显著优势。此外,我们的模型在四个视频-语言数据集上达到了新的技术水平,包括 iVQA(+4.8)、Next-QA(+6.9)、和Activitynet-QA(+4.8),在 CIDEr 中达到了 MSRVTT-Caption(+3.6)的水平。
https://arxiv.org/abs/2301.11507
Video-Language Pre-training models have recently significantly improved various multi-modal downstream tasks. Previous dominant works mainly adopt contrastive learning to achieve global feature alignment across modalities. However, the local associations between videos and texts are not modeled, restricting the pre-training models' generality, especially for tasks requiring the temporal video boundary for certain query texts. This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment such that the trained model can accurately perceive temporal boundaries in videos given the text description. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features. To produce temporal boundaries, frame features in several videos are manually merged into a long video sequence that interacts with a text sequence. With the localization task, our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality. Notably, comprehensive experimental results show that our method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval. The code will be released soon.
https://arxiv.org/abs/2301.07463