Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.
视频语料库 moment 检索(VCMR)是一项新的视频检索任务,旨在从大量未剪辑的视频中检索相关时刻,使用自然语言文本作为查询。视频与查询之间的相关性是局部的,主要表现在两个方面:(1)范围:未剪辑的视频包含信息丰富的帧,但并非所有都与查询相关。通常只有相关时刻之间的强相关性才会出现,强调捕捉关键内容的重要性;(2)模式:查询对不同模式的相关性有所不同;动作描述更接近视觉元素,而角色对话则更涉及文本信息。识别和解决这些模式特定微小差异对于 VCMR 的有效检索至关重要。然而,现有的方法通常将所有视频内容平等对待,导致 VCMR 的性能下降。我们认为,准确捕捉查询与视频之间的部分相关性对 VCMR 任务至关重要。为此,我们提出了一个 Partial Relevance Enhanced Model(PREM)来提高 VCMR 的性能。VCMR 包括两个子任务:视频检索和时刻定位。为了与它们的独特目标保持一致,我们为实现针对不同模式的专用部分相关性增强策略。对于视频检索,我们引入了一种多模态协同视频检索器,通过模式特定的池化生成针对不同模态的单独查询表示,确保更有效的匹配。对于时刻定位,我们提出了一个关注- then-融合时刻定位器,利用模式特定的门捕捉关键内容,然后将多模态信息融合为时刻定位。在 TVR 和 DiDeMo 数据集上的实验结果表明,与基线相比,所提出的模型表现出色,实现了 VCMR 的新最好状态。
https://arxiv.org/abs/2402.13576
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.
视频数据集 Moment 检索(VCMR)是一个关注于在广泛的未剪辑视频数据集中查找特定时刻的实用视频检索任务,使用自然语言查询。现有的 VCMR 方法通常依赖于帧感知视频检索,通过计算查询和视频帧之间的相似度来排名视频。然而,这种方法忽视了帧间信息内嵌入的语义结构,即事件,这是人类理解视频的关键元素。为了实现这一目标,我们提出了 EventFormer 模型,该模型明确利用视频中的事件作为视频检索的基本单位。通过事件推理和层次结构事件编码来提取事件表示。事件推理模块将连续且视觉上相似的帧表示分组为事件,而层次结构事件编码在帧和事件级别上编码信息。我们还引入了锚多头自注意力,以鼓励 Transformer 捕捉视频中的相关内容。通过两个分支的对比学习和双优化来训练 EventFormer。在 TVR、ANetCaps 和 DiDeMo 基准测试上进行的实验表明,EventFormer 在 VCMR 取得了有效性和效率,实现了最新的最先进水平。此外,EventFormer 的有效性还在部分相关视频检索任务上得到了验证。
https://arxiv.org/abs/2402.13566
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
尽管预训练的视觉语言模型已经在大型网络视频中的视频文本检索表现出了显著的提高,但通过手动注释带有开始和结束时间的片段仍然在视频文本检索中扮演着关键角色,这需要大量的人力劳动。为了解决这个问题,我们探索了一个替代的、更便宜的标注来源,即单时刻度,用于视频文本检索。我们以一种启发式的方式从时刻开始对片段进行初始化,以热身检索模型。然后我们提出了一种视频片段编辑方法,以优化初始粗略边界的精细度,从而提高检索性能。我们还引入了一个学生-教师网络来进行视频片段编辑。教师模型用于编辑训练集中的片段,而学生模型在编辑后的片段上进行训练。教师权重在学生成绩增加后从学生那里更新。我们的方法对模型一无所知,而且适用于任何检索模型。我们对三种最先进的检索模型:COOT、VideoCLIP和CLIP4Clip进行了实验。对三个视频检索数据集:YouCook2、DiDeMo和ActivityNet-Captions的实验表明,我们编辑的片段在所有三个检索模型中都能持续提高检索性能。
https://arxiv.org/abs/2402.02335
Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at this https URL.
现有视频语言研究主要集中在学习短视频片段,而很少涉及对长视频的长期依赖关系的探讨,因为模型的计算成本过高,导致对长视频的建模存在过高的问题。为解决这一问题,一个可行的解决方案是学习视频片段和字幕之间的对应关系,然而这不可避免地遇到了多粒度噪声匹配(MNC)问题。具体来说,MNC指的是视频片段和字幕对齐误差(粗粒度)和帧词对齐误差(细粒度),这阻碍了时序学习和视频理解。在本文中,我们提出了一个统一最优传输(OT)框架下的Norton噪声鲁棒时序优化(Norton)来解决MNC问题。总之,Norton通过视频段落和视频片段对比损失来捕捉基于OT的长期依赖关系。为解决视频段落对比中的粗粒度对齐问题,Norton通过可调的提示桶对无关的片段和字幕进行过滤,并根据传输距离将异步视频片段对齐。为解决细粒度对齐问题,Norton使用软最大操作来确定关键单词和关键帧。此外,Norton还利用视频片段对比中的潜在有缺陷负样本,通过OT分配对齐目标来确保精确的时间建模。对视频检索、视频QA和动作分割等领域的实验证实了我们的方法的有效性。代码可以从该链接获取:https://this URL。
https://arxiv.org/abs/2401.16702
There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.
长期以来,人们一直在寻求一个统一的多媒体理解模型,以实现各种多模态理解任务,这模仿了人类听、看和阅读的过程。人类倾向于使用两种独立系统来表示知识:一种表示口头(文本)信息,另一种表示非口头(视觉和听觉)信息。这两种系统可以独立运行,也可以相互交互。为了理解人类认知,本文我们引入了CoAVT——一种新颖的基于认知的跨模态预训练模型,以连接这三个模态。它包含一个联合音频-视觉编码器,学会了在非口头信息中同时编码音频-视觉同步信息,以及一个文本编码器来处理口头信息。为了弥合模态之间的差距,CoAVT采用了一个查询编码器,其中包含一系列可学习的查询嵌入,并提取相应文本中的最有信息量的音频-视觉特征。此外,为了利用音频和视觉与语言之间的对应关系,我们还基于发现的音频-文本和视觉-文本双模态对齐来建立音频-文本和视觉-文本的生物模态对齐,以增强多模态表示学习。最后,我们与三个多模态目标共同优化CoAVT模型:对比损失、匹配损失和语言建模损失。大量实验证明,CoAVT可以学习强大的多模态相关性,并在音频词视频检索任务OnAudioCaps上实现零散和微调设置,以及AudioSet和VGGSound上的音频-视觉事件分类和音频-视觉检索任务。
https://arxiv.org/abs/2401.12264
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at this https URL
文本-视频检索是一个关键的多模态任务,用于找到与文本查询最相关的视频。尽管像CLIP这样的预训练模型在這個領域展現了令人驚嘆的潛力,但由于模型的成本因模型大小增加而持續增加,這仍然是一個問題。為了解決這個挑戰,已經出現了提示調試作為一種替代方法。然而,當將預訓練的圖像-文本模型適應下游視頻-文本任務時,現有作品仍然面臨兩個問題:(1)視覺編碼器只能編碼帧級特征,而無法提取全局層次的視頻信息。(2)將視覺和文本編碼器與分開的提示相结合,無法解決視覺-文本模態差距。因此,我們提出了DGL,一個跨模態的動態提示調試方法,具有全局-local視頻注意力。與以前的提示調試方法不同,我們使用共享的潜在空間生成局部級別的文本和圖像提示,鼓勵模態交互。此外,我們提出了一种將視頻建模為全局-local注意力的全局視頻信息捕捉方法。大量的實驗發現,僅僅調整0.67%的參數,我們的跨模態提示調試策略DGL在MSR-VTT、VATEX、LSMDC和ActivityNet數據集上的表現已超越或與完全調試方法相当。該URL為:
https://arxiv.org/abs/2401.10588
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
近年来,视觉语言模型的进步很大程度上归功于图像-文本数据的丰富。我们试图复制这一成功,但目前可用的视频-文本数据仅仅足够少,无法满足需求。因此,我们不得不从具有强图像语言基线的视频语言模型进行微调,并使用合成指令数据进行微调。这样,我们得到的视频语言模型被用于自动标注数百万个视频以生成高质量字幕。我们证明了调整后的视频语言模型在广泛的视频语言基准测试中表现良好。例如,它比open-ended NExT-QA的最佳先前结果提高了2.8%。此外,我们的模型为之前未见过的视频生成了详细的描述,这比现有方法提供了更好的文本监督。实验证明,在为这些自动生成的字幕进行视频语言双重编码的对比训练后,视频语言双编码器模型比最强的基线模型提高了3.8%。我们的最佳模型在MSR-VTT零散文本到视频检索上比最强的基线模型提高了6%。
https://arxiv.org/abs/2401.06129
Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.
文本-视频检索是一个具有挑战性的任务,旨在根据文本查询识别相关的视频。与传统的文本检索相比,文本-视频检索的主要障碍是查询的文本性质和视频内容的视觉丰富性之间的语义差距。以前的工作主要集中在通过精细聚合词帧匹配信号将查询和视频对齐。受到人类在判断文本和视频之间的相关性时采用模块化判断过程的启发,由于视频内容的连续和复杂性,判断需要高阶匹配信号。在本文中,我们提出了一种基于块级的文本-视频匹配,其中查询块被提取以描述特定的检索单位,视频块被分割成来自视频的片段。我们将块级匹配建模为词查询和视频帧之间的n阶相关性建模,并引入了n阶相关性建模的多模态超图。通过将文本单元和视频帧表示为节点,并使用边来描绘它们之间的关系,得到了一个n阶相关性建模的多模态超图。这样,查询和视频可以在高阶语义空间对齐。此外,为了提高模型的泛化能力,提取的特征被输入计算机制以获得高斯分布下的变分表示。超图和变分推理的引入使得我们的模型能够捕捉文本和视觉内容之间的复杂n阶交互。实验结果表明,我们提出的方法在文本-视频检索任务上实现了最先进的性能。
https://arxiv.org/abs/2401.03177
We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
我们提出了一个名为视频路径问题的问题,用于指导教学视频的导航。给定一个源视频和一个自然语言查询,请求以某种方式改变如何视频的执行路径,目标是找到一个相关的“路径视频”,满足所请求的改变。为了解决这个挑战,我们提出了VidDetours,一种新颖的视频-文本方法,它学会了从大量如何-视频的存储库中检索目标时间片段。此外,我们还设计了一个基于语言的管道,利用了如何-视频的旁白文本来创建弱监督训练数据。我们将我们的想法应用于如何烹饪视频的领域,用户可以从当前食谱中找到替代食材、工具和技术。在16K个带标签的地面真实现习语数据集上进行验证,我们证明了我们的模型在视频检索和问题回答方面的显著改进,召回率超过现有方法的35%。
https://arxiv.org/abs/2401.01823
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
近年来,基于CLIP的文本转视频检索方法经历了快速的发展。演变的主要方向是利用更广泛的视觉和文本提示来实现对齐。具体来说,那些具有出色性能的方法通常为句子(单词)-视频(帧)交互设计重的融合模块,无论计算复杂度如何。然而,这些方法在特征利用和检索效率方面并不是最优的。为了解决这个问题,我们采用了多粒度视觉特征学习,确保在训练阶段模型能够全面捕捉从抽象到详细程度的视觉内容特征。为了更好地利用多粒度特征,我们在检索阶段设计了一个两阶段检索架构。这个解决方案巧妙地平衡了检索内容的粗细粒度。此外,它还实现了检索有效性和效率的和谐平衡。具体来说,在训练阶段,我们设计了一个无参数文本门控(TIB)用于精细视频表示学习,并内嵌入一个Pearson约束以优化跨模态表示学习。在检索阶段,我们使用粗粒度的视频表示来快速召回前k个候选项,然后通过精细的视觉表示对其进行排序。在四个基准测试上进行的大量实验证明,这种方法具有高效性和有效性。值得注意的是,与最先进的方法相比,我们的方法在性能上具有相似的竞争力,而速度却快了几乎50倍。
https://arxiv.org/abs/2401.00701
Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. The generalization capability of our self-supervised video method is evidenced by its state-of-the-art performance in a wide range of high-level semantic tasks, including video retrieval, action classification, and video attribute recognition (such as object and scene identification), as well as low-level temporal correspondence tasks like video object segmentation and pose tracking. Additionally, we show that the video representations learned through our method exhibit increased robustness to the input perturbations.
自监督方法在视频理解任务中已经取得了令人印象深刻的成果。然而,与利用时域自监督的早期工作不同,目前最先进的视频方法主要依赖于来自图像领域的任务(例如对比学习),而没有明确促进学习时域特征。我们发现现有时域自监督的两个限制因素:1)任务过于简单,导致训练效果饱和;2)我们基于局部外观统计学发现了短路,从而阻碍了高层次特征的学习。为了应对这些问题,我们提出了以下两个方案:1)将时域自监督重新定义为帧级(而不是片段级)识别任务,2)采用有效的增强策略来缓解短路。我们的模型通过通过时域自监督预训练的单个视频帧,加上我们通过时域自监督训练的Transformer,来扩展预训练的表示。我们通过实验证明了, our more challenging frame-level task formulations and the removal of shortcuts significantly improve the quality of features learned through temporal self-supervision. 我们模型的自监督视频方法的泛化能力体现在其在广泛的先进语义任务中的最佳性能,包括视频检索、动作分类和视频属性识别(如物体和场景识别),以及低级时序对应任务,如视频物体分割和姿态跟踪。此外,我们还证明了通过我们的方法学习到的视频表示具有对输入扰动的增强鲁棒性。
https://arxiv.org/abs/2312.13008
A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.
一段短视频可能包含多个事件的进展和有趣的故事线。人类需要捕捉每个镜头中的事件,并将它们联系在一起,以理解其背后的故事。在这项工作中,我们提出了一个新的多镜头视频理解基准Shot2Story20K,带有详细的镜头级别字幕和全面的视频摘要。为了促进更好地语义理解视频,我们提供了视觉信号和人类叙述的 caption。我们设计了几种不同的任务,包括单镜头视频和叙述性 captioning,多镜头视频摘要和带有描述的图像检索。初步实验表明,生成一个长且全面的视频摘要存在一些挑战。然而,生成的不完美的摘要已经可以显著提高现有视频理解任务的性能,如视频问答,探索了一个未被探索的视频理解设置,带有详细的摘要。
https://arxiv.org/abs/2312.10300
Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at this https URL.
视觉检索旨在寻找给定查询项目的情选画廊中最相关的视觉项目,例如图像和视频。准确性和效率是检索任务中两个相互竞争的目标。为了追求进一步提高准确度,本文提出了一种多教师蒸馏框架Whiten-MTD,它能够将预训练的检索模型的知识传递给轻量级的 student 模型,实现高效的视觉检索。此外,我们还发现不同检索模型的相似性是多样化和不可比较的,这使得从多个模型共同蒸馏知识具有挑战性。因此,我们在融合前对教师模型的输出进行预处理,使得多个教师蒸馏模型能够有效进行。Whiten-MTD 具有直观的简单性和实际有效的特点。在两个里程碑图像检索数据集和一个视频检索数据集上的实验表明,我们所提出的方法的有效性,以及其在检索性能和效率上的良好平衡。我们的源代码已发布在上述链接处。
https://arxiv.org/abs/2312.09716
Text-video retrieval, a prominent sub-field within the broader domain of multimedia content management, has witnessed remarkable growth and innovation over the past decade. However, existing methods assume the video scenes are consistent and the description annotators are unbiased. These limitations fail to align with fluid real-world scenarios, and descriptions can be influenced by annotator biases, diverse writing styles, and varying textual perspectives. To overcome the aforementioned problems, we introduce WAVER, a cross-domain knowledge distillation mechanism designed to tackle the challenge of handling writing-style agnostics. WAVER capitalizes on the open-vocabulary properties inherent in pre-trained vision-language models and employs an implicit knowledge distillation approach to transfer text-based knowledge from a teacher model to a vision-based student. Empirical studies conducted across four standard benchmark datasets, encompassing various settings, provide compelling evidence that \WAVER can achieve state-of-the-art performance in text-video retrieval tasks while handling writing-style variations.
文本视频检索是多媒体内容管理的一个重要子领域,在过去的十年里见证了显著的增长和创新。然而,现有的方法假设视频场景是一致的,描述注释者没有偏见。这些限制没有与流动的现实场景对齐,并且描述可能受到注释者偏见、多样写作风格和不同文本观点的影响。为了克服上述问题,我们引入了WAVER,一个跨领域知识蒸馏机制,旨在解决处理写作风格不确定的挑战。WAVER利用预训练视觉语言模型固有的开放词汇性质,并采用一种隐式知识蒸馏方法,将基于文本的知识从教师模型传递给基于视觉的学生模型。通过在四个标准基准数据集上进行实证研究,涵盖各种设置,我们提供了令人信服的证据,表明WAVER可以在处理写作风格变化的情况下,在文本视频检索任务中实现最先进的性能。
https://arxiv.org/abs/2312.09507
A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably enhances video-text retrieval by a relative improvement of 8.3\% in video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms of R@1. Additionally, in video moment retrieval, average mAP shows a relative improvement ranging from 2.0\% to 13.7 \% across different baselines.
对文本数据的深入理解是多模态视频分析任务中的基本要素。然而,最近的工作表明,当前的模型在目标下游任务的训练过程中无法全面理解文本数据。与以前的方法不同,我们假设根据目标任务理解句子成分可能有助于提高模型的性能。因此,我们利用预训练的大型语言模型(LLM)生成针对原始文本的文本样本,针对特定的句子成分进行定向。我们提出了一个弱监督的重要性估计模块,计算组件的相对重要性,并利用它们来改善不同的视频-语言任务。通过严格的定量分析,我们提出的方法在多个视频-语言任务上都取得了显著的改进。特别是,我们的方法在视频-文本检索方面显著增强了视频-到-文本和文本-到-视频检索的相对改善率,在R@1方面,相对改善了8.3%。此外,在视频时刻检索中,平均mAP在不同的基线之间显示出相对改善,从2.0%到13.7%不等。
https://arxiv.org/abs/2312.06699
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.
在本文中,我们提出了一种高效且高性能的部分相关视频检索(PRVR)方法,旨在检索输入文本查询中包含至少一个相关时刻的未剪辑长视频。在效率和性能方面,之前的研究被忽视的一个瓶颈是密帧的视觉编码。这使得研究者选择轻量级的视觉骨干,但由于其学习到的视觉表示能力有限,导致检索性能低于他们的能力。然而,简单地用高性能的大规模视觉与语言模型(VLMs)替换它们并不理想,因为它们的效率太低了。为了应对这些问题,我们关注超图像,这是通过将视频帧按照 $N \times N$ 的网格布局重新排列来创建的。这减少了视觉编码的数量至 $\frac{1}{N^2}$,并弥补了大规模 VLMs 的低效率,使我们可以将它们用作强大的编码器。令人惊讶的是,我们发现,通过一个简单的查询图像关注技巧,VLMs 很好地向超图像进行扩展,并高效地对抗了目前的最优方法。此外,我们通过将几个可训练的模块集成到 VLM 骨干网络中,提出了一种微调方法。实验结果表明,我们的方法在活动网络捕捉和 TVR 上实现了最佳性能。
https://arxiv.org/abs/2312.00414
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at this https URL.
现有的长视频检索系统在段落到视频检索模式下进行训练和测试,其中每个长视频都由一段长描述来描述。这忽略了描述视频中可能存在的丰富性和多样性,可以详细描述视频中的每个时刻,或者用一个短语概述,或者在其中的任何地方。为了对长视频检索系统的能力进行更彻底的评估,我们提出了一个利用最先进的较大语言模型生成一系列丰富多样的合成视频描述的管道。我们通过严谨的人检查来验证这个管道的可靠性。然后,我们使用几个大型视频数据集来基准这些合成视频描述的语言模型,发现它们在转换数据上表现不佳,尤其是最短的描述。我们还提出了一种轻量级的微调方法,我们使用对比损失来基于各种描述之间信息差异的程度学习层次嵌入损失。我们的方法在下游段落到视频检索任务(在ActivityNet上的R@1值+1.1%)以及我们使用合成数据计算的各种长视频检索指标上都取得了良好的性能(在ActivityNet上的短描述上的R@1值+3.6%)。对于数据访问和其他细节,请参阅我们的项目网站,链接在此:https://www.aclweb.org/anthology/N22-21-6666。
https://arxiv.org/abs/2312.00115
Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (text + video) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, demonstrating that the goal of generalizable procedural video understanding models is far out and underscoring the need to develop new approaches to these tasks. Data, model, and code will be publicly released.
学习视频是一个新兴的研究领域,它使机器人能够从人类演示中获取技能,例如程序视频。为此,视频语言模型必须能够获得结构化的理解,例如将演示的时间分割为一系列动作和技能的时间序列,并将理解泛化到新的领域。为了实现这一目标,我们引入了Spacewalk-18,一个包含两个任务的基准:(1)步骤识别;(2)在国际空间站空间行走录音数据集中的视频内检索。与此同时,这两个任务衡量了模型利用以下能力:1)跨域视觉信息;2)高时间上下文窗口;3)多模态(文本+视频)领域。这不同于现有程序视频理解的基准,通常处理短上下文长度,并且可以用单一模式解决。Spacewalk-18,由于其固有的多模态和长形式复杂性,揭示了任务识别和分割的高难度。我们发现,最先进的方法在我们的基准上表现不佳,这表明泛化程序视频理解模型的目标是远远超出了,并强调了开发新的方法来解决这些任务的需求。数据、模型和代码将公开发布。
https://arxiv.org/abs/2311.18773
To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its compactness, is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks including video question answering, text-to-video retrieval, etc. In particular, our E-ViLM obtains obvious efficiency improvements by reaching competing performances with faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also provide extensive ablative studies that validate the effectiveness of our proposed learning schema for E-ViLM.
要为具有挑战性的现实世界任务构建可扩展的模型,了解各种形式(如视频、文本和图像)的多样多模态数据非常重要。在现有工作中,许多作品集中利用大型但笨重的跨模态架构。尽管它们的有效性,但较大的架构无法将模型扩展到现实世界的应用中,因此构建轻量级的VL架构和高效的学习模式具有很大的实际价值。在本文中,我们提出了一个高效的视频-语言模型(称为E-ViLM)和一个掩码视频建模(MVM)方案,并使用语义向量量化tokenizer进行辅助。特别地,我们的E-ViLM学会了从预训练的向量量化tokenizer产生的遮罩视频区域的语义标签中重构语义标签,该tokenizer将连续的视觉信号离散化成了标签。我们证明了,即使我们的简单MVM任务和常规VL预训练模型比较紧凑,我们的E-ViLM也能够从视频-语言数据集中学习到有表达性的表示,并且在广泛的视频-语言任务(包括视频问答和文本-到-视频检索等)上表现良好。特别是,我们的E-ViLM通过达到与具有更快的推理速度的竞争水平取得了显著的效率改进,即我们的模型在MSRVTT基准上达到$39.3\%$的Top-$1$准确率,仅保留$15\%$的参数和$94.8\%$的GFLOPs。我们还提供了广泛的差分研究,证实了我们的学习方案对E-ViLM的有效性。
https://arxiv.org/abs/2311.17267
Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at this https URL.
大型的预训练视觉模型在计算机视觉方面取得了令人印象深刻的成功。然而,为下游任务(特别是视频理解)完全对大型模型进行微调可能会导致计算成本过高。最近的研究将他们的注意力转向了高效的图像到视频传输学习。然而,现有的高效微调方法缺乏对训练内存使用和将较大模型转移到视频领域的探索。在本文中,我们提出了一个名为Side4Video的新 Spatial-Temporal Side Network,用于记忆高效的微调大型图像模型以视频理解,具体来说,我们附加了一个轻量级的空间-时间侧网络附着在冻结的视觉模型上,避免了通过沉重预训练模型进行反向传播,并利用原始图像模型的多级空间特征。极具内存效率的架构使我们能够将内存使用量减少75%,比以前基于适配器的方法实现更大的ViT-E(4.4B)和ViT-L(304M)。通过这种方式,我们可以将ViT-E(4.4B)用于视频理解任务,这是ViT-L(304M)的14倍。我们的方法在各种视频数据集上的各种任务(即单模态和跨模态任务,如动作识别和文本-视频检索)上的表现令人印象深刻,特别是在 Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)上。我们的代码发布在以下 URL 上。
https://arxiv.org/abs/2311.15769