Video_Retrieval

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

2024-04-22 10:23:59

Xuzheng Yu, Chen Jiang, Xingning Dong, Tian Gan, Ming Yang, Qingpei Guo

arXiv_CV

arXiv_CV Pose Action Activity Matching Video_Retrieval
Abstract

The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.

Abstract (translated)

近年来，短视频应用程序的用户基础经历了空前的增长，导致对视频内容分析的需求显著增加。特别是，文本-视频检索，旨在从庞大的视频语料库中找到与给定文本描述的匹配视频，是至关重要的功能，其挑战在于弥合模态差距。然而，大多数现有方法仅仅将文本视为离散的标记，而忽略了它们的语法结构。此外，视频中丰富的空间和时间线索往往没有被充分利用，因为缺乏与文本的交互。为了应对这些问题，我们认为将文本作为指导，集中关注视频中的相关时态帧和空间区域，从两个角度弥合模态差距是有益的。在本文中，我们提出了一种新颖的语法层次结构增强文本-视频检索方法（SHE-Net），它利用了文本固有的语义和语法层次结构，从两个角度弥合模态差距。首先，为了促进更细粒度的视觉内容整合，我们采用文本语法层次结构，该结构揭示了文本描述的语法结构，指导视觉表示。其次，为了进一步增强多模态交互和对齐，我们还利用语法层次结构指导相似度计算。我们对MSR-VTT、MSVD、DiDeMo和ActivityNet等四个公共文本-视频检索数据集进行了实验评估。实验结果和消融实验证实了我们提出的方法的优势。

URL

https://arxiv.org/abs/2404.14066

PDF

https://arxiv.org/pdf/2404.14066.pdf
Read All
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

2024-04-18 14:20:30

Han Fang, Xianghao Zang, Chao Ban, Zerun Feng, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun

arXiv_CV

arXiv_CV Caption Relation Pose Action Video_Retrieval
Abstract

Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (\textit{ProTA}) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, \textit{ProTA} achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).

Abstract (translated)

文本-视频检索的目的是找到与给定查询最相关的跨模态样本。最近的方法集中于建模整个空间-时间关系。然而，由于视频片段包含比字幕更丰富的内容，因此模型对 these 不对称视频-文本对进行对齐有很高的风险，可能导致许多假阳性结果。在本文中，我们提出概率词聚合（ProTA）来处理跨模态交互中的内容差异。具体来说，我们提出了一种 dual partial-related aggregation 来解离和重新聚合低维度和高维度空间中的标记词表示。我们提出基于标记词的概率对齐来生成标记级概率表示，并保持特征表示多样性。此外，还提出了一种自适应对比损失来学习紧凑的跨模态分布空间。通过广泛的实验，ProTA在 MSR-VTT（50.9%）、LSMDC（25.8%）和 DiDeMo（47.2%）上取得了显著的改进。

URL

https://arxiv.org/abs/2404.12216

PDF

https://arxiv.org/pdf/2404.12216.pdf
Read All
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

2024-03-26 17:59:52

Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

arXiv_CV

arXiv_CV Regularization Embedding Inference Pose Video_Retrieval
Abstract

The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.

Abstract (translated)

随着视频片段的日益普及，对文本-视频检索的兴趣不断增加。最近，研究的重点在于建立文本和视频共同的嵌入空间，利用一致的嵌入表示计算相似度。然而，现有数据集中的文本内容通常较短且简洁，使得难以完全描述视频的冗余语义。相应地，单个文本嵌入可能不足以捕捉视频嵌入并增强检索。在这项研究中，我们提出了一种新的随机文本建模方法T-MASS，即文本被视为随机嵌入，以丰富文本嵌入的灵活性和韧性，产生文本质量。具体来说，我们引入了一个相似度感知半径模块，以便在给定的文本-视频对中调整文本质量的规模。此外，我们还设计和开发了一种支持文本正则化，以在训练过程中进一步控制文本质量。推理过程也专门设计以充分利用文本质量进行准确检索。实验证据表明，T-MASS不仅有效地吸引了相关的文本-视频对，还将无关的 ones远离，而且还可以精确地确定相关对之间的文本嵌入。我们的实验结果表明，与基线相比（相对精度@1从3%到6.3%），T-MASS在T@1方面有显著的改进。此外，T-MASS在包括MSRVTT、LSMDC、DiDeMo、VATEX和Charades在内的五个基准数据集上实现了最先进的性能。

URL

https://arxiv.org/abs/2403.17998

PDF

https://arxiv.org/pdf/2403.17998.pdf
Read All
Composed Video Retrieval via Enriched Context and Discriminative Embeddings

2024-03-25 17:59:03

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

arXiv_CV

arXiv_CV Embedding Pose Zero-Shot Video_Retrieval
Abstract

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{this https URL}

Abstract (translated)

组成的视频检索（CoVR）是计算机视觉领域的一个具有挑战性的问题，最近在大型数据库中强调了将修改文本与视觉查询相结合以实现更复杂视频搜索的重要性。现有工作主要依赖于视觉查询与修改文本的组合来区分相关的视频。然而，这种策略很难在检索到的目标视频中完全保留查询特定的上下文信息，仅使用视觉嵌入来表示目标视频。我们提出了一个新颖的CoVR框架，它利用详细的语言描述来明确编码查询特定的上下文信息，并仅使用视觉和文本嵌入来学习更准确的匹配目标视频。我们提出的框架可以灵活应用于组合视频（CoVR）和图像（CoIR）检索任务。在三个数据集上的实验表明，我们的方法在CovR和零散CoIR任务上均取得了最先进的性能，召回@K=1得分甚至达到了约7%的提高。我们的代码、模型和详细语言描述可在 \url{this <https://this URL>} 这个网站上获得。

URL

https://arxiv.org/abs/2403.16997

PDF

https://arxiv.org/pdf/2403.16997.pdf
Read All
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

2024-02-21 07:16:06

Danyang Hou, Liang Pang, Huawei Shen, Xueqi Cheng

arXiv_CV

arXiv_CV Relation Pose Action Enhancement Video_Retrieval
Abstract

Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.

Abstract (translated)

视频语料库 moment 检索（VCMR）是一项新的视频检索任务，旨在从大量未剪辑的视频中检索相关时刻，使用自然语言文本作为查询。视频与查询之间的相关性是局部的，主要表现在两个方面：（1）范围：未剪辑的视频包含信息丰富的帧，但并非所有都与查询相关。通常只有相关时刻之间的强相关性才会出现，强调捕捉关键内容的重要性；（2）模式：查询对不同模式的相关性有所不同；动作描述更接近视觉元素，而角色对话则更涉及文本信息。识别和解决这些模式特定微小差异对于 VCMR 的有效检索至关重要。然而，现有的方法通常将所有视频内容平等对待，导致 VCMR 的性能下降。我们认为，准确捕捉查询与视频之间的部分相关性对 VCMR 任务至关重要。为此，我们提出了一个 Partial Relevance Enhanced Model（PREM）来提高 VCMR 的性能。VCMR 包括两个子任务：视频检索和时刻定位。为了与它们的独特目标保持一致，我们为实现针对不同模式的专用部分相关性增强策略。对于视频检索，我们引入了一种多模态协同视频检索器，通过模式特定的池化生成针对不同模态的单独查询表示，确保更有效的匹配。对于时刻定位，我们提出了一个关注- then-融合时刻定位器，利用模式特定的门捕捉关键内容，然后将多模态信息融合为时刻定位。在 TVR 和 DiDeMo 数据集上的实验结果表明，与基线相比，所提出的模型表现出色，实现了 VCMR 的新最好状态。

URL

https://arxiv.org/abs/2402.13576

PDF

https://arxiv.org/pdf/2402.13576.pdf
Read All
Event-aware Video Corpus Moment Retrieval

2024-02-21 06:55:20

Danyang Hou, Liang Pang, Huawei Shen, Xueqi Cheng

arXiv_CV

arXiv_CV Optimization Transformer Pose Contrastive_Learning Video_Retrieval
Abstract

Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.

Abstract (translated)

视频数据集 Moment 检索（VCMR）是一个关注于在广泛的未剪辑视频数据集中查找特定时刻的实用视频检索任务，使用自然语言查询。现有的 VCMR 方法通常依赖于帧感知视频检索，通过计算查询和视频帧之间的相似度来排名视频。然而，这种方法忽视了帧间信息内嵌入的语义结构，即事件，这是人类理解视频的关键元素。为了实现这一目标，我们提出了 EventFormer 模型，该模型明确利用视频中的事件作为视频检索的基本单位。通过事件推理和层次结构事件编码来提取事件表示。事件推理模块将连续且视觉上相似的帧表示分组为事件，而层次结构事件编码在帧和事件级别上编码信息。我们还引入了锚多头自注意力，以鼓励 Transformer 捕捉视频中的相关内容。通过两个分支的对比学习和双优化来训练 EventFormer。在 TVR、ANetCaps 和 DiDeMo 基准测试上进行的实验表明，EventFormer 在 VCMR 取得了有效性和效率，实现了最新的最先进水平。此外，EventFormer 的有效性还在部分相关视频检索任务上得到了验证。

URL

https://arxiv.org/abs/2402.13566

PDF

https://arxiv.org/pdf/2402.13566.pdf
Read All
Video Editing for Video Retrieval

2024-02-04 04:13:31

Bin Zhu Kevin Flanagan Adriano Fragomeni Michael Wray Dima Damen

arXiv_CV

arXiv_CV Caption Language_Model Transformer Pose Activity Video_Retrieval
Abstract

Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.

Abstract (translated)

尽管预训练的视觉语言模型已经在大型网络视频中的视频文本检索表现出了显著的提高，但通过手动注释带有开始和结束时间的片段仍然在视频文本检索中扮演着关键角色，这需要大量的人力劳动。为了解决这个问题，我们探索了一个替代的、更便宜的标注来源，即单时刻度，用于视频文本检索。我们以一种启发式的方式从时刻开始对片段进行初始化，以热身检索模型。然后我们提出了一种视频片段编辑方法，以优化初始粗略边界的精细度，从而提高检索性能。我们还引入了一个学生-教师网络来进行视频片段编辑。教师模型用于编辑训练集中的片段，而学生模型在编辑后的片段上进行训练。教师权重在学生成绩增加后从学生那里更新。我们的方法对模型一无所知，而且适用于任何检索模型。我们对三种最先进的检索模型：COOT、VideoCLIP和CLIP4Clip进行了实验。对三个视频检索数据集：YouCook2、DiDeMo和ActivityNet-Captions的实验表明，我们编辑的片段在所有三个检索模型中都能持续提高检索性能。

URL

https://arxiv.org/abs/2402.02335

PDF

https://arxiv.org/pdf/2402.02335.pdf
Read All
Multi-granularity Correspondence Learning from Long-term Noisy Videos

2024-01-30 03:03:26

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng

arXiv_CV

arXiv_CV Caption Video_Caption Segmentation QA Pose Action Video_Retrieval
Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at this https URL.

Abstract (translated)

现有视频语言研究主要集中在学习短视频片段，而很少涉及对长视频的长期依赖关系的探讨，因为模型的计算成本过高，导致对长视频的建模存在过高的问题。为解决这一问题，一个可行的解决方案是学习视频片段和字幕之间的对应关系，然而这不可避免地遇到了多粒度噪声匹配（MNC）问题。具体来说，MNC指的是视频片段和字幕对齐误差（粗粒度）和帧词对齐误差（细粒度），这阻碍了时序学习和视频理解。在本文中，我们提出了一个统一最优传输（OT）框架下的Norton噪声鲁棒时序优化（Norton）来解决MNC问题。总之，Norton通过视频段落和视频片段对比损失来捕捉基于OT的长期依赖关系。为解决视频段落对比中的粗粒度对齐问题，Norton通过可调的提示桶对无关的片段和字幕进行过滤，并根据传输距离将异步视频片段对齐。为解决细粒度对齐问题，Norton使用软最大操作来确定关键单词和关键帧。此外，Norton还利用视频片段对比中的潜在有缺陷负样本，通过OT分配对齐目标来确保精确的时间建模。对视频检索、视频QA和动作分割等领域的实验证实了我们的方法的有效性。代码可以从该链接获取：https://this URL。

URL

https://arxiv.org/abs/2401.16702

PDF

https://arxiv.org/pdf/2401.16702.pdf
Read All
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

2024-01-22 08:16:48

Xianghu Yue, Xiaohai Tian, Malu Zhang, Zhizheng Wu, Haizhou Li

arXiv_SD

arXiv_SD Classification Embedding Represenation_Learning Relation Knowledge Language_Model Transformer Zero-Shot Matching Video_Retrieval
Abstract

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

Abstract (translated)

长期以来，人们一直在寻求一个统一的多媒体理解模型，以实现各种多模态理解任务，这模仿了人类听、看和阅读的过程。人类倾向于使用两种独立系统来表示知识：一种表示口头（文本）信息，另一种表示非口头（视觉和听觉）信息。这两种系统可以独立运行，也可以相互交互。为了理解人类认知，本文我们引入了CoAVT——一种新颖的基于认知的跨模态预训练模型，以连接这三个模态。它包含一个联合音频-视觉编码器，学会了在非口头信息中同时编码音频-视觉同步信息，以及一个文本编码器来处理口头信息。为了弥合模态之间的差距，CoAVT采用了一个查询编码器，其中包含一系列可学习的查询嵌入，并提取相应文本中的最有信息量的音频-视觉特征。此外，为了利用音频和视觉与语言之间的对应关系，我们还基于发现的音频-文本和视觉-文本双模态对齐来建立音频-文本和视觉-文本的生物模态对齐，以增强多模态表示学习。最后，我们与三个多模态目标共同优化CoAVT模型：对比损失、匹配损失和语言建模损失。大量实验证明，CoAVT可以学习强大的多模态相关性，并在音频词视频检索任务OnAudioCaps上实现零散和微调设置，以及AudioSet和VGGSound上的音频-视觉事件分类和音频-视觉检索任务。

URL

https://arxiv.org/abs/2401.12264

PDF

https://arxiv.org/pdf/2401.12264.pdf
Read All
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

2024-01-19 09:58:06

Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang

arXiv_CV

arXiv_CV Face Attention Pose Action Activity Video_Retrieval
Abstract

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at this https URL

Abstract (translated)

文本-视频检索是一个关键的多模态任务，用于找到与文本查询最相关的视频。尽管像CLIP这样的预训练模型在這個領域展現了令人驚嘆的潛力，但由于模型的成本因模型大小增加而持續增加，這仍然是一個問題。為了解決這個挑戰，已經出現了提示調試作為一種替代方法。然而，當將預訓練的圖像-文本模型適應下游視頻-文本任務時，現有作品仍然面臨兩個問題：（1）視覺編碼器只能編碼帧級特征，而無法提取全局層次的視頻信息。（2）將視覺和文本編碼器與分開的提示相结合，無法解決視覺-文本模態差距。因此，我們提出了DGL，一個跨模態的動態提示調試方法，具有全局-local視頻注意力。與以前的提示調試方法不同，我們使用共享的潜在空間生成局部級別的文本和圖像提示，鼓勵模態交互。此外，我們提出了一种將視頻建模為全局-local注意力的全局視頻信息捕捉方法。大量的實驗發現，僅僅調整0.67%的參數，我們的跨模態提示調試策略DGL在MSR-VTT、VATEX、LSMDC和ActivityNet數據集上的表現已超越或與完全調試方法相当。該URL為：

URL

https://arxiv.org/abs/2401.10588

PDF

https://arxiv.org/pdf/2401.10588.pdf
Read All
Distilling Vision-Language Models on Millions of Videos

2024-01-11 18:59:53

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

arXiv_CV

arXiv_CV Caption QA Language_Model Transformer Zero-Shot Video_Retrieval
Abstract

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

Abstract (translated)

近年来，视觉语言模型的进步很大程度上归功于图像-文本数据的丰富。我们试图复制这一成功，但目前可用的视频-文本数据仅仅足够少，无法满足需求。因此，我们不得不从具有强图像语言基线的视频语言模型进行微调，并使用合成指令数据进行微调。这样，我们得到的视频语言模型被用于自动标注数百万个视频以生成高质量字幕。我们证明了调整后的视频语言模型在广泛的视频语言基准测试中表现良好。例如，它比open-ended NExT-QA的最佳先前结果提高了2.8%。此外，我们的模型为之前未见过的视频生成了详细的描述，这比现有方法提供了更好的文本监督。实验证明，在为这些自动生成的字幕进行视频语言双重编码的对比训练后，视频语言双编码器模型比最强的基线模型提高了3.8%。我们的最佳模型在MSR-VTT零散文本到视频检索上比最强的基线模型提高了6%。

URL

https://arxiv.org/abs/2401.06129

PDF

https://arxiv.org/pdf/2401.06129.pdf
Read All
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

2024-01-06 09:38:55

Qian Li, Lixin Su, Jiashu Zhao, Long Xia, Hengyi Cai, Suqi Cheng, Hengzhu Tang, Junfeng Wang, Dawei Yin

arXiv_CV

arXiv_CV Relation Inference Pose Action Matching Video_Retrieval
Abstract

Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.

Abstract (translated)

文本-视频检索是一个具有挑战性的任务，旨在根据文本查询识别相关的视频。与传统的文本检索相比，文本-视频检索的主要障碍是查询的文本性质和视频内容的视觉丰富性之间的语义差距。以前的工作主要集中在通过精细聚合词帧匹配信号将查询和视频对齐。受到人类在判断文本和视频之间的相关性时采用模块化判断过程的启发，由于视频内容的连续和复杂性，判断需要高阶匹配信号。在本文中，我们提出了一种基于块级的文本-视频匹配，其中查询块被提取以描述特定的检索单位，视频块被分割成来自视频的片段。我们将块级匹配建模为词查询和视频帧之间的n阶相关性建模，并引入了n阶相关性建模的多模态超图。通过将文本单元和视频帧表示为节点，并使用边来描绘它们之间的关系，得到了一个n阶相关性建模的多模态超图。这样，查询和视频可以在高阶语义空间对齐。此外，为了提高模型的泛化能力，提取的特征被输入计算机制以获得高斯分布下的变分表示。超图和变分推理的引入使得我们的模型能够捕捉文本和视觉内容之间的复杂n阶交互。实验结果表明，我们提出的方法在文本-视频检索任务上实现了最先进的性能。

URL

https://arxiv.org/abs/2401.03177

PDF

https://arxiv.org/pdf/2401.03177.pdf
Read All
Detours for Navigating Instructional Videos

2024-01-03 16:38:56

Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman

arXiv_CV

arXiv_CV Weakly_Supervised Pose Video_Retrieval
Abstract

We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.

Abstract (translated)

我们提出了一个名为视频路径问题的问题，用于指导教学视频的导航。给定一个源视频和一个自然语言查询，请求以某种方式改变如何视频的执行路径，目标是找到一个相关的“路径视频”，满足所请求的改变。为了解决这个挑战，我们提出了VidDetours，一种新颖的视频-文本方法，它学会了从大量如何-视频的存储库中检索目标时间片段。此外，我们还设计了一个基于语言的管道，利用了如何-视频的旁白文本来创建弱监督训练数据。我们将我们的想法应用于如何烹饪视频的领域，用户可以从当前食谱中找到替代食材、工具和技术。在16K个带标签的地面真实现习语数据集上进行验证，我们证明了我们的模型在视频检索和问题回答方面的显著改进，召回率超过现有方法的35%。

URL

https://arxiv.org/abs/2401.01823

PDF

https://arxiv.org/pdf/2401.01823.pdf
Read All
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

2024-01-01 08:54:18

Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li

arXiv_CV

arXiv_CV Represenation_Learning Action Video_Retrieval
Abstract

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

Abstract (translated)

近年来，基于CLIP的文本转视频检索方法经历了快速的发展。演变的主要方向是利用更广泛的视觉和文本提示来实现对齐。具体来说，那些具有出色性能的方法通常为句子（单词）-视频（帧）交互设计重的融合模块，无论计算复杂度如何。然而，这些方法在特征利用和检索效率方面并不是最优的。为了解决这个问题，我们采用了多粒度视觉特征学习，确保在训练阶段模型能够全面捕捉从抽象到详细程度的视觉内容特征。为了更好地利用多粒度特征，我们在检索阶段设计了一个两阶段检索架构。这个解决方案巧妙地平衡了检索内容的粗细粒度。此外，它还实现了检索有效性和效率的和谐平衡。具体来说，在训练阶段，我们设计了一个无参数文本门控（TIB）用于精细视频表示学习，并内嵌入一个Pearson约束以优化跨模态表示学习。在检索阶段，我们使用粗粒度的视频表示来快速召回前k个候选项，然后通过精细的视觉表示对其进行排序。在四个基准测试上进行的大量实验证明，这种方法具有高效性和有效性。值得注意的是，与最先进的方法相比，我们的方法在性能上具有相似的竞争力，而速度却快了几乎50倍。

URL

https://arxiv.org/abs/2401.00701

PDF

https://arxiv.org/pdf/2401.00701.pdf
Read All
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

2023-12-20 13:20:31

Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah

arXiv_AI

arXiv_AI Video_Caption Segmentation Recognition Classification Tracking Transformer Pose Action Self-Supervised Contrastive_Learning Video_Retrieval
Abstract

Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. The generalization capability of our self-supervised video method is evidenced by its state-of-the-art performance in a wide range of high-level semantic tasks, including video retrieval, action classification, and video attribute recognition (such as object and scene identification), as well as low-level temporal correspondence tasks like video object segmentation and pose tracking. Additionally, we show that the video representations learned through our method exhibit increased robustness to the input perturbations.

Abstract (translated)

自监督方法在视频理解任务中已经取得了令人印象深刻的成果。然而，与利用时域自监督的早期工作不同，目前最先进的视频方法主要依赖于来自图像领域的任务（例如对比学习），而没有明确促进学习时域特征。我们发现现有时域自监督的两个限制因素：1）任务过于简单，导致训练效果饱和；2）我们基于局部外观统计学发现了短路，从而阻碍了高层次特征的学习。为了应对这些问题，我们提出了以下两个方案：1）将时域自监督重新定义为帧级（而不是片段级）识别任务，2）采用有效的增强策略来缓解短路。我们的模型通过通过时域自监督预训练的单个视频帧，加上我们通过时域自监督训练的Transformer，来扩展预训练的表示。我们通过实验证明了， our more challenging frame-level task formulations and the removal of shortcuts significantly improve the quality of features learned through temporal self-supervision. 我们模型的自监督视频方法的泛化能力体现在其在广泛的先进语义任务中的最佳性能，包括视频检索、动作分类和视频属性识别（如物体和场景识别），以及低级时序对应任务，如视频物体分割和姿态跟踪。此外，我们还证明了通过我们的方法学习到的视频表示具有对输入扰动的增强鲁棒性。

URL

https://arxiv.org/abs/2312.13008

PDF

https://arxiv.org/pdf/2312.13008.pdf
Read All
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

2023-12-16 03:17:30

Mingfei Han, Xiaojun Chang, Heng Wang, Linjie Yang

arXiv_CV

arXiv_CV Caption Video_Caption Summarization Video_Retrieval
Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

Abstract (translated)

一段短视频可能包含多个事件的进展和有趣的故事线。人类需要捕捉每个镜头中的事件，并将它们联系在一起，以理解其背后的故事。在这项工作中，我们提出了一个新的多镜头视频理解基准Shot2Story20K，带有详细的镜头级别字幕和全面的视频摘要。为了促进更好地语义理解视频，我们提供了视觉信号和人类叙述的 caption。我们设计了几种不同的任务，包括单镜头视频和叙述性 captioning，多镜头视频摘要和带有描述的图像检索。初步实验表明，生成一个长且全面的视频摘要存在一些挑战。然而，生成的不完美的摘要已经可以显著提高现有视频理解任务的性能，如视频问答，探索了一个未被探索的视频理解设置，带有详细的摘要。

URL

https://arxiv.org/abs/2312.10300

PDF

https://arxiv.org/pdf/2312.10300.pdf
Read All
Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval

2023-12-15 11:43:56

Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, Lei Yang

arXiv_CV

arXiv_CV Image_Retrieval Knowledge Pose Video_Retrieval
Abstract

Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at this https URL.

Abstract (translated)

视觉检索旨在寻找给定查询项目的情选画廊中最相关的视觉项目，例如图像和视频。准确性和效率是检索任务中两个相互竞争的目标。为了追求进一步提高准确度，本文提出了一种多教师蒸馏框架Whiten-MTD，它能够将预训练的检索模型的知识传递给轻量级的 student 模型，实现高效的视觉检索。此外，我们还发现不同检索模型的相似性是多样化和不可比较的，这使得从多个模型共同蒸馏知识具有挑战性。因此，我们在融合前对教师模型的输出进行预处理，使得多个教师蒸馏模型能够有效进行。Whiten-MTD 具有直观的简单性和实际有效的特点。在两个里程碑图像检索数据集和一个视频检索数据集上的实验表明，我们所提出的方法的有效性，以及其在检索性能和效率上的良好平衡。我们的源代码已发布在上述链接处。

URL

https://arxiv.org/abs/2312.09716

PDF

https://arxiv.org/pdf/2312.09716.pdf
Read All
WAVER: Writing-style Agnostic Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge

2023-12-15 03:17:37

Huy Le, Tung Kieu, Anh Nguyen, Ngan Le

arXiv_AI

arXiv_AI Knowledge Language_Model Transformer Video_Retrieval
Abstract

Text-video retrieval, a prominent sub-field within the broader domain of multimedia content management, has witnessed remarkable growth and innovation over the past decade. However, existing methods assume the video scenes are consistent and the description annotators are unbiased. These limitations fail to align with fluid real-world scenarios, and descriptions can be influenced by annotator biases, diverse writing styles, and varying textual perspectives. To overcome the aforementioned problems, we introduce WAVER, a cross-domain knowledge distillation mechanism designed to tackle the challenge of handling writing-style agnostics. WAVER capitalizes on the open-vocabulary properties inherent in pre-trained vision-language models and employs an implicit knowledge distillation approach to transfer text-based knowledge from a teacher model to a vision-based student. Empirical studies conducted across four standard benchmark datasets, encompassing various settings, provide compelling evidence that \WAVER can achieve state-of-the-art performance in text-video retrieval tasks while handling writing-style variations.

Abstract (translated)

文本视频检索是多媒体内容管理的一个重要子领域，在过去的十年里见证了显著的增长和创新。然而，现有的方法假设视频场景是一致的，描述注释者没有偏见。这些限制没有与流动的现实场景对齐，并且描述可能受到注释者偏见、多样写作风格和不同文本观点的影响。为了克服上述问题，我们引入了WAVER，一个跨领域知识蒸馏机制，旨在解决处理写作风格不确定的挑战。WAVER利用预训练视觉语言模型固有的开放词汇性质，并采用一种隐式知识蒸馏方法，将基于文本的知识从教师模型传递给基于视觉的学生模型。通过在四个标准基准数据集上进行实证研究，涵盖各种设置，我们提供了令人信服的证据，表明WAVER可以在处理写作风格变化的情况下，在文本视频检索任务中实现最先进的性能。

URL

https://arxiv.org/abs/2312.09507

PDF

https://arxiv.org/pdf/2312.09507.pdf
Read All
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning

2023-12-10 02:03:51

Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Rahul Pratap Singh, Bishmoy Paul, Ali Dabouei, Min Xu

arXiv_CV

arXiv_CV Weakly_Supervised Knowledge Language_Model Quantitative Pose Video_Retrieval
Abstract

A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably enhances video-text retrieval by a relative improvement of 8.3\% in video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms of R@1. Additionally, in video moment retrieval, average mAP shows a relative improvement ranging from 2.0\% to 13.7 \% across different baselines.

Abstract (translated)

对文本数据的深入理解是多模态视频分析任务中的基本要素。然而,最近的工作表明,当前的模型在目标下游任务的训练过程中无法全面理解文本数据。与以前的方法不同,我们假设根据目标任务理解句子成分可能有助于提高模型的性能。因此,我们利用预训练的大型语言模型(LLM)生成针对原始文本的文本样本,针对特定的句子成分进行定向。我们提出了一个弱监督的重要性估计模块,计算组件的相对重要性,并利用它们来改善不同的视频-语言任务。通过严格的定量分析,我们提出的方法在多个视频-语言任务上都取得了显著的改进。特别是,我们的方法在视频-文本检索方面显著增强了视频-到-文本和文本-到-视频检索的相对改善率,在R@1方面,相对改善了8.3%。此外,在视频时刻检索中,平均mAP在不同的基线之间显示出相对改善,从2.0%到13.7%不等。

URL

https://arxiv.org/abs/2312.06699

PDF

https://arxiv.org/pdf/2312.06699.pdf
Read All
Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

2023-12-01 08:38:27

Taichi Nishimura, Shota Nakada, Masayoshi Kondo

arXiv_CV

arXiv_CV Caption Attention Language_Model Transformer Pose Activity Zero-Shot Video_Retrieval
Abstract

In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.

Abstract (translated)

在本文中，我们提出了一种高效且高性能的部分相关视频检索（PRVR）方法，旨在检索输入文本查询中包含至少一个相关时刻的未剪辑长视频。在效率和性能方面，之前的研究被忽视的一个瓶颈是密帧的视觉编码。这使得研究者选择轻量级的视觉骨干，但由于其学习到的视觉表示能力有限，导致检索性能低于他们的能力。然而，简单地用高性能的大规模视觉与语言模型（VLMs）替换它们并不理想，因为它们的效率太低了。为了应对这些问题，我们关注超图像，这是通过将视频帧按照 $N \times N$ 的网格布局重新排列来创建的。这减少了视觉编码的数量至 $\frac{1}{N^2}$，并弥补了大规模 VLMs 的低效率，使我们可以将它们用作强大的编码器。令人惊讶的是，我们发现，通过一个简单的查询图像关注技巧，VLMs 很好地向超图像进行扩展，并高效地对抗了目前的最优方法。此外，我们通过将几个可训练的模块集成到 VLM 骨干网络中，提出了一种微调方法。实验结果表明，我们的方法在活动网络捕捉和 TVR 上实现了最佳性能。

URL

https://arxiv.org/abs/2312.00414

PDF

https://arxiv.org/pdf/2312.00414.pdf
Read All

Content

Video_Retrieval (20)

Video_Retrieval

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL