Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
借鉴物理运动灵感的想法,我们提出了一个新的自监督动态学习策略:视频实例区分(ViDiDi)。ViDiDi是一种简单且数据有效的策略,可以迅速应用于基于实例歧视的自监督视频表示学习框架。在其核心,ViDiDi通过观察帧序列的各个时间导数来关注视频的不同方面。这些导数与原始帧一起支持离散时间下的连续动态的泰勒级数展开。高阶导数突显了更高阶的运动特征。ViDiDi通过一种平衡交替学习算法学习一个视频及其时间导数的一一对应的可视化表示。通过学习原始帧和导数的 consistent representations,编码器被引导强调动态特征,揭示原始帧中的隐藏动态。因此,视频表示更好地通过动态特征进行区分。我们将ViDiDi集成到现有的实例歧视框架(VICReg,BYOL和SimCLR)中,用于对UCF101或Kinetics进行预训练,并在包括视频检索、动作识别和动作检测的标准基准上进行测试。性能通过不需要大型模型或大量数据而显著提高。
https://arxiv.org/abs/2409.02371
Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: this https URL.
temporal video alignment 旨在将两个视频中的关键事件(如物体交互或动作阶段转换)同步。这类方法可以为各种视频编辑、处理和理解任务提供有益的帮助。然而,现有方法在假设给出合适的视频对进行对齐时非常有限,这使得它们的适用性受到很大限制。为了解决这个问题,我们将 temporal alignment 重新定位为搜索问题,并引入了 Alignable Video Retrieval(AVR)任务。给定查询视频,我们的方法可以从大量视频片段中找到 well-alignable 视频,并将其与查询视频同步。为实现这一目标,我们做出了以下三个关键贡献:1)我们引入了 DRAQ,这是一种视频对齐度指标,用于识别和重新排名一组候选视频中的最佳对齐视频;2)我们提出了一个有效且通用视频特征设计,以提高多个离散特征表示的同步性能;3)我们提出了使用循环一致性度量的新 benchmark 和评估协议 for AVR。我们对 3 个数据集的实验结果表明,我们的方法在从多样数据集中确定对齐视频对方面非常有效。项目页面:此链接。
https://arxiv.org/abs/2409.01445
Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model's high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.
大多数文本视频检索方法都利用预训练的CLIP作为骨干,并包含复杂的模块,导致计算开销较高。因此,许多研究关注高效调整。高效适应的主要挑战来自于图像和视频模态固有的差异。每个随机视频帧都必须由图像编码器独立处理,这增加了复杂性并使得实际部署更加复杂。尽管现有的高效方法通过较小的训练参数进行微调,但由于具有较大的标记数量,它们仍然导致较高的推理成本。在本文中,我们认为时间冗余对模型由于连续帧间信息的重复而导致的复杂度高有重大影响。现有的图像模型标记压缩方法未能解决这一独特挑战,因为它们忽视了帧间的时间冗余。为了解决这些问题,我们提出了 Temporal Token Merging(TempMe)来减少时间冗余。具体来说,我们引入了一个渐进多粒度框架。通过逐渐合并相邻片段,我们跨越不同帧合并时间标记,并学习视频级特征,从而降低复杂性并获得更好的性能。大量实验证实了 TempMe 的优越性。与之前的文本视频检索方法相比,TempMe 通过减少输出标记 by 95% 和 GFLOPs by 51% 显著提高了性能。此外,TempMe 通过有效地集成高效的和完全调整方法展示了稳健的泛化能力。在完全调整后,TempMe 的 R-Sum 改进达到了 7.9%,训练速度达到 1.57X,并且使用了 75.2% 的GPU内存。我们的代码将发布。
https://arxiv.org/abs/2409.01156
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at this https URL.
当前的文本-视频检索方法主要依赖于查询和视频之间的跨模态匹配来计算它们的相似度分数,然后对结果进行排序。这种方法考虑了每个候选视频与查询之间的匹配,但它需要显著的时间成本,并且在候选视频增加时会增加显著。在自然语言处理和计算机视觉中,生成模型很常见,并且在文档检索中取得了成功应用,但在多模态检索中的应用仍然是一个未探索的领域。为了提高检索效率,本文提出了一种基于模型的视频索引器T2VIndexer,它是一种直接生成视频标识的序列到序列生成模型,用于检索具有恒定时间复杂度的候选视频。T2VIndexer的目标是在保持高准确性的同时减少检索时间。为实现这一目标,我们提出了视频标识编码和查询标识增强方法,将视频表示为短序列,同时保留其语义信息。我们的方法在四个标准数据集上 consistent地增强了当前最先进模型的检索效率。它使得仅使用原始检索时间的30\%-50\%的基线模型就能实现更好的 MSR-VTT (+1.0%)、MSVD (+1.8%)、ActivityNet (+1.5%) 和 DiDeMo (+0.2%) 上的检索性能。代码可在此处访问:https://www.xxx.com/。
https://arxiv.org/abs/2408.11432
Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.
文本-视频检索(TVR)旨在将相关视频内容与相应的自然语言查询对齐和关联。大多数现有的TVR方法都是基于大规模预训练的视觉-语言模型(例如,CLIP)。然而,由于CLIP固有的简单结构,很少有TVR方法探索多尺度表示,这些多尺度表示能提供对更全面理解的丰富上下文信息。为此,我们提出了MUSE,一种具有线性计算复杂性的多尺度MAMBA,用于高效的跨分辨率建模。具体来说,多尺度表示是由在最后一个单尺度特征图上应用特征金字塔生成的。然后,我们利用Mamba结构作为高效的多尺度学习器,共同学习尺度上的表示。此外,我们进行了全面的研究,以探究不同的模型结构和设计。在三个流行的基准测试上,MUSE取得了很好的结果,验证了其优越性。
https://arxiv.org/abs/2408.10575
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations. Composition understanding becomes particularly challenging for video data since the compositional relations rapidly change over time in videos. We first build a benchmark named AARO to evaluate composition understanding related to actions on top of spatial concepts. The benchmark is constructed by generating negative texts with incorrect action descriptions for a given video and the model is expected to pair a positive text with its corresponding video. Furthermore, we propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding. We also develop a negative-augmented visual-language matching loss which is used explicitly to benefit from the generated negative text. We compare NAVERO with other state-of-the-art methods in terms of compositional understanding as well as video-text retrieval performance. NAVERO achieves significant improvement over other methods for both video-language and image-language composition understanding, while maintaining strong performance on traditional text-video retrieval tasks.
我们研究了Video-Language(VidL)模型在理解物体、属性、动作及其关系的组合方面的能力。对于视频数据,组合理解变得特别具有挑战性,因为组合关系在视频中会快速变化。首先,我们构建了一个名为AARO的基准来评估与动作相关的组合理解。基准是通过为给定视频生成错误的动作描述的负向文本来构建的,模型被期望将积极文本与其相应视频进行匹配。此外,我们提出了一个利用视频文本数据增强负向文本的训练方法NAVERO,以增强组合理解。我们还开发了一个负向文本增强的视觉语言匹配损失,该损失用于明确利用生成的负向文本。我们比较NAVERO与其他最先进的 methods在组合理解和视频文本检索性能方面的表现。NAVERO在视频语言和图像语言的组合理解方面取得了显著的改进,同时保持传统文本-视频检索任务的强大性能。
https://arxiv.org/abs/2408.09511
In the rapidly expanding domain of web video content, the task of text-video retrieval has become increasingly critical, bridging the semantic gap between textual queries and video data. This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video, enhancing the effectiveness of text-video retrieval systems. Unlike traditional model-centric methods that focus on designing intricate cross-modal interaction mechanisms, GQE aims to expand the text queries associated with videos both during training and testing phases. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions, effectively bridging the data imbalance gap. Furthermore, during retrieval, GQE utilizes Large Language Models (LLM) to generate a diverse set of queries and a query selection module to filter these queries based on relevance and diversity, thus optimizing retrieval performance while reducing computational overhead. Our contributions include a detailed examination of the information imbalance challenge, a novel approach to query expansion in video-text datasets, and the introduction of a query selection strategy that enhances retrieval accuracy without increasing computational costs. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX, demonstrating the effectiveness of addressing text-video retrieval from a data-centric perspective.
在快速扩展的 Web 视频内容领域,文本视频检索任务变得越来越关键,将文本查询和视频数据之间的语义差距缩小。本文提出了一种新颖的数据中心方法,通用查询扩展(GQE),来解决文本和视频之间的固有信息不平衡,提高文本视频检索系统的有效性。与传统的模型中心方法不同,GQE 旨在在训练和测试阶段扩大与视频相关的文本查询。通过自适应地将视频剪辑为短片段并使用零 shot 字幕,GQE 丰富了训练数据集,有效地弥合了数据不平衡。此外,在检索过程中,GQE 利用大型语言模型(LLM)生成一系列问题,并采用查询选择模块根据相关性和多样性过滤这些问题,从而在减少计算开销的同时优化检索性能。我们的贡献包括对信息不平衡挑战的详细分析、视频文本数据集中的查询扩展新方法和引入查询选择策略以提高检索准确性的详细说明。GQE 在多个基准测试中都取得了最先进的性能,包括 MSR-VTT、MSVD、LSMDC 和 VATEX,这表明从数据中心的角度解决文本视频检索是有效的。
https://arxiv.org/abs/2408.07249
Implicit Neural Networks (INRs) have emerged as powerful representations to encode all forms of data, including images, videos, audios, and scenes. With video, many INRs for video have been proposed for the compression task, and recent methods feature significant improvements with respect to encoding time, storage, and reconstruction quality. However, these encoded representations lack semantic meaning, so they cannot be used for any downstream tasks that require such properties, such as retrieval. This can act as a barrier for adoption of video INRs over traditional codecs as they do not offer any significant edge apart from compression. To alleviate this, we propose a flexible framework that decouples the spatial and temporal aspects of the video INR. We accomplish this with a dictionary of per-frame latents that are learned jointly with a set of video specific hypernetworks, such that given a latent, these hypernetworks can predict the INR weights to reconstruct the given frame. This framework not only retains the compression efficiency, but the learned latents can be aligned with features from large vision models, which grants them discriminative properties. We align these latents with CLIP and show good performance for both compression and video retrieval tasks. By aligning with VideoLlama, we are able to perform open-ended chat with our learned latents as the visual inputs. Additionally, the learned latents serve as a proxy for the underlying weights, allowing us perform tasks like video interpolation. These semantic properties and applications, existing simultaneously with ability to perform compression, interpolation, and superresolution properties, are a first in this field of work.
隐式神经网络(INRs)已成为对所有数据形式进行编码的有力量的表示,包括图像、视频、音频和场景。对于视频,许多INRs在压缩任务中已经提出,并且最近的方法相对于编码时间、存储和重构质量有显著的改进。然而,这些编码表示缺乏语义意义,因此它们无法用于任何需要这种特性的下游任务,例如检索。这可能会成为采用视频INRs的传统编码的障碍,因为它们除了压缩作用外没有提供任何显著的优点。为了减轻这种障碍,我们提出了一个灵活的框架,将视频INR的空间和时间方面解耦。我们通过学习与视频特定超网络共同学习的每个帧的潜在词典来完成这一目标,使得给定潜在,这些超网络可以预测INR权重以重构给定的帧。这个框架不仅保留了压缩效率,而且学习到的潜在可以与大型视觉模型的特征对齐,从而赋予它们判别性质。我们将这些潜在与CLIP对齐,并展示良好的压缩和视频检索任务性能。通过与VideoLlama对齐,我们能够与学习到的潜在进行 open-ended chat,作为视觉输入。此外,学习到的潜在还作为底层重量的代理,使我们能够执行视频平滑等任务。这些语义特性和应用,与压缩能力、平滑和超分辨率特性同时出现,是这一领域的工作中的第一次。
https://arxiv.org/abs/2408.02672
Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching -- expert commentary, expert video retrieval, and the first-of-its-kind expert pose generation -- outperforming strong vision-language models on both established metrics and human preference studies.
反馈对于学习新技能或提高现有技能水平至关重要。然而,目前从视频进行技能评估的方法仅提供分数或比较演示,这使得用户不知道应该采取哪些不同的措施。我们提出了一种新颖的方法,可以从一个人做运动(如篮球或足球)的视频中生成具有行动able反馈的方法。我们的方法基于一个视频演示及其相应的3D身体姿势,生成(1)自由形式的专家评论,描述这个人做得好的地方以及可以改进的地方,和(2)视觉专家演示,包括所需的修正。我们展示了如何利用Ego-Exo4D的熟练活动视频和专家评论,以及强大的语言模型,为这项任务创建一个弱监督的训练数据集,并设计了一个多模态视频语言模型以推断教练反馈。我们的方法能够通过多模态输入组合进行推理,输出完整的、行动able教练——专家评论、专家视频检索,以及世界上独一无二的专家姿态生成,——超过在既定指标和人类偏好研究上 strong vision-language 模型。
https://arxiv.org/abs/2408.00672
With the rapid development of the short video industry, traditional e-commerce has encountered a new paradigm, video-driven e-commerce, which leverages attractive videos for product showcases and provides both video and item services for users. Benefitting from the dynamic and visualized introduction of items,video-driven e-commerce has shown huge potential in stimulating consumer confidence and promoting sales. In this paper, we focus on the video retrieval task, facing the following challenges: (1) Howto handle the heterogeneities among users, items, and videos? (2)How to mine the complementarity between items and videos for better user understanding? In this paper, we first leverage the dual graph to model the co-existing of user-video and user-item interactions in video-driven e-commerce and innovatively reduce user preference understanding to a graph matching problem. To solve it, we further propose a novel bi-level Graph Matching Network(GMN), which mainly consists of node- and preference-level graph matching. Given a user, node-level graph matching aims to match videos and items, while preference-level graph matching aims to match multiple user preferences extracted from both videos and items. Then the proposed GMN can generate and improve user embedding by aggregating matched nodes or preferences from the dual graph in a bi-level manner. Comprehensive experiments show the superiority of the proposed GMN with significant improvements over state-of-the-art approaches (e.g., AUC+1.9% and CTR+7.15%). We have developed it on a well-known video-driven e-commerce platform, serving hundreds of millions of users every day
随着短视频行业的快速发展,传统电子商务遇到了一个新的范式,即视频驱动的电子商务,它利用吸引人的视频进行产品展示,并为用户提供视频和商品服务。受益于动态和可视化的商品介绍,视频驱动的电子商务在刺激消费者信心和促进销售方面显示出巨大的潜力。在本文中,我们专注于视频检索任务,面临以下挑战:(1)如何处理用户、商品和视频之间的异质性? (2)如何从视频中挖掘物品和视频之间的互补性,以便更好地理解用户?本文中,我们首先利用双图建模用户-视频和用户-商品之间的交互,然后创新地将用户偏好理解降低到图匹配问题。为解决这个问题,我们进一步提出了新颖的 bi- 级图匹配网络(GMN),主要由节点和偏好级图匹配组成。给定一个用户,节点级图匹配旨在匹配视频和商品,而偏好级图匹配旨在匹配从视频和商品中提取的多个用户偏好。然后,所提出的 GMN可以通过双级聚合从双图中匹配的节点或偏好进行生成和提高用户嵌入。全面的实验证明,与最先进的解决方案相比(例如 AUC+1.9% 和 CTR+7.15%),所提出的 GMN 的优越性非常明显。我们还在一个著名的视频驱动电子商务平台上开发了它,每天为数亿人提供服务。
https://arxiv.org/abs/2408.00346
Text-based person re-identification (Re-ID) is a challenging topic in the field of complex multimodal analysis, its ultimate aim is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. Despite the wide range of applicable areas such as security surveillance, video retrieval, person tracking, and social media analytics, there is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. To address this gap, we propose to introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task. We start by laying the groundwork for text-based person Re-ID, elucidating fundamental concepts related to attribute/natural language-based identification. Then a thorough examination of existing benchmark datasets and metrics is presented. Subsequently, we further delve into prevalent feature extraction strategies employed in text-based person Re-ID research, followed by a concise summary of common network architectures within the domain. Prevalent loss functions utilized for model optimization and modality alignment in text-based person Re-ID are also scrutinized. To conclude, we offer a concise summary of our findings, pinpointing challenges in text-based person Re-ID. In response to these challenges, we outline potential avenues for future open-set text-based person Re-ID and present a baseline architecture for text-based pedestrian image generation-guided re-identification(TBPGR).
基于文本的人身识别(Re-ID)是复杂多模态分析领域的一个具有挑战性的课题,其最终目标是通过对属性/自然语言描述的审查来识别特定的行人。尽管应用领域很广,如安全监控、视频检索、行人追踪和社会媒体分析,但缺乏对基于文本的人身识别的全面综述是一个显著的不足。为了填补这个空白,我们提议引入一个评估、策略、架构和优化维度的分类法,全面调查基于文本的人身识别(Re-ID)任务。我们首先阐明了基于文本的人身识别的基础知识,阐明了与属性/自然语言-基于识别相关的核心概念。接着,我们详细介绍了现有基准数据集和指标。随后,我们深入研究了文本中常用的人身识别特征提取策略,接着是对该领域中常见的网络架构的简要概述。我们还对用于模型优化和模式对齐的现有损失函数进行了审查。总之,我们给出了我们调查的简要总结,指出了基于文本的人身识别的挑战。为了应对这些挑战,我们提出了未来可以开放设置的基于文本的人身识别的可能途径,并提出了基于文本行人图像生成引导的人身识别(TBPGR)的基准架构。
https://arxiv.org/abs/2408.00096
In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at this https URL.
https://arxiv.org/abs/2407.16658
Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.
https://arxiv.org/abs/2407.16394
The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.
https://arxiv.org/abs/2407.15566
The rapid expansion of multimedia content has made accurately retrieving relevant videos from large collections increasingly challenging. Recent advancements in text-video retrieval have focused on cross-modal interactions, large-scale foundation model training, and probabilistic modeling, yet often neglect the crucial user perspective, leading to discrepancies between user queries and the content retrieved. To address this, we introduce MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation), a novel, training-free pipeline that leverages Large Language Models (LLMs) for iterative feedback learning. MERLIN refines query embeddings from a user perspective, enhancing alignment between queries and video content through a dynamic question answering process. Experimental results on datasets like MSR-VTT, MSVD, and ActivityNet demonstrate that MERLIN substantially improves Recall@1, outperforming existing systems and confirming the benefits of integrating LLMs into multimodal retrieval systems for more responsive and context-aware multimedia retrieval.
多媒体内容的迅速扩展使得准确从大型集合中检索相关视频变得越来越具有挑战性。最近在文本视频检索方面的进步主要集中在跨模态交互、大规模基础模型训练和概率建模,然而往往忽视了用户视角,导致用户查询和检索内容之间的差异。为解决这个问题,我们引入了MERLIN(通过LLM的迭代反馈学习优化多模态嵌入),一种新颖的、无需训练的训练通道,它利用大型语言模型(LLMs)进行迭代反馈学习。MERLIN从用户角度优化查询嵌入,通过动态问答过程增强查询和视频内容之间的对齐。在像MSR-VTT、MSVD和ActivityNet这样的数据集上的实验结果表明,MERLIN极大地提高了Recall@1,超过了现有系统,证实了将LLM集成到多模态检索系统中可以实现更响应、更上下文关注意味多媒体检索的优点。
https://arxiv.org/abs/2407.12508
Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task.
理解视频中发生的事件及其固有时间逻辑对于视频文本检索至关重要。然而,从网络爬取的预训练数据集中往往缺乏足够的事件信息,而广泛采用的视频级跨模态对比学习也难以捕捉详细和复杂视频文本事件对齐。为了应对这些挑战,我们在数据和模型方面都进行了改进。在预训练数据方面,我们专注于通过所提出的事件增强策略补充缺失的具体事件内容和事件时间转换。基于事件增强的数据,我们构建了一种新颖的事件感知视频文本检索模型,即EA-VTR,通过卓越的视觉事件意识实现强大的视频文本检索能力。EA-VTR可以同时编码帧级和视频级视觉表示,实现详细的事件内容和复杂的事件时间跨模态对齐,最终提高视频事件的全面理解。我们的方法不仅在多个数据集上的Text-to-Video Retrieval和Video Action Recognition任务上显著优于现有方法,而且在多事件视频文本检索和视频时刻检索任务上表现出卓越的事件内容感知能力,以及出色的时间逻辑理解能力。
https://arxiv.org/abs/2407.07478
Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
数据质量是决定视频语言表示学习有效性的首要因素。然而,之前的数据中通常不完美地匹配视频文本对,这可能导致不准确地反映跨模态语义的视频语言表示。此外,以前的数据也具有不均匀的概念分布,从而阻碍了对于非流行主题的下游性能。为了应对这些问题,我们提出了一个具有细差角边缘的对比性目标来对跨模态表示进行正则化,以达到完美的相似度。此外,为了适应概念分布的不均匀性,我们提出了一个多层感知器(MLP)参数化加权函数,该函数将损失值映射到样本权重,从而在训练过程中动态调整模型的关注点。在训练过程中,我们通过少量的无偏元数据和由大型视觉语言模型生成的视频文本数据来指导训练,从而改善视频语言表示并实现对于常见视频问题回答和文本-视频检索数据集的优越性能。
https://arxiv.org/abs/2407.03788
We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at this https URL.
我们提出了一个名为 Referring Atomic Video Action Recognition (RAVAR) 的新任务,旨在根据特定人员的文本描述和视频数据识别该人员的原子动作。与传统的动作识别和定位不同,我们的任务专注于识别特定人员的正确原子动作,通过文本指导。为了探索这个任务,我们提出了 RefAVA 数据集,其中包括 36,630 个带有手动标注的个体文本描述的实例。为了建立一个强大的初始基准,我们从各种领域实现了各种基线,例如原子动作的定位、视频问题回答和文本-视频检索。由于这些现有方法在 RAVAR 上表现不佳,我们引入了 RefAtomNet——一个专门针对 RAVAR 独特挑战的新颖跨流关注方法:为指定人员解释文本指征,利用这个参考引导空间定位并捕获指定期人的原子动作预测。关键成分是:(1)连接视频、文本和新位置语义流的Multi-stream架构;(2)跨流代理注意力和代理标记融合,增强这些通道中的最相关信息,并 consistently超越基于标准的注意力融合在 RAVAR 上。大量的实验证明 RefAtomNet 和其构建模块在识别描述人员的动作方面非常有效。数据集和代码将公开发布在这个 https URL 上。
https://arxiv.org/abs/2407.01872
The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between each pair of text (consisting of words) and video (consisting of audio and image frames) representations. However, some problems exist in the representation alignment of video and text, such as a text, and further each word, are of different importance for video frames. Besides, audio usually carries additional or critical information for TVR in the case that frames carry little valid information. Therefore, in TVR task, multi-granularity representation of text, including whole sentence and every word, and the modal of audio are salutary which are underutilized in most existing works. To address this, we propose a novel multi-granularity feature interaction module called MGFI, consisting of text-frame and word-frame, for video-text representations alignment. Moreover, we introduce a cross-modal feature interaction module of audio and text called CMFI to solve the problem of insufficient expression of frames in the video. Experiments on benchmark datasets such as MSR-VTT, MSVD, DiDeMo show that the proposed method outperforms the existing state-of-the-art methods.
文本转视频检索(TVR)任务的关键在于学习每对文本(由单词组成)和视频(由音频和图像帧组成)表示之间的独特相似性。然而,在文本和视频表示对齐的过程中存在一些问题,例如文本和每个单词的重要性不同,以及在视频帧中携带的有效信息通常较少。此外,音频通常携带与TVR无关的额外或关键信息。因此,在TVR任务中,多粒度文本表示,包括整个句子和每个单词,以及音频的模态,在大多数现有作品中都被低估了。为了解决这个问题,我们提出了一个名为MGFI的多粒度特征交互模块,用于视频文本表示对齐。此外,我们还引入了一个跨模态特征交互模块来解决视频中帧表达不足的问题。在基准数据集如MSR-VTT、MSVD和DiDeMo上进行实验,结果表明,与现有最先进的方法相比,所提出的方法表现出优异的性能。
https://arxiv.org/abs/2407.12798
A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing human intervention and enabling scalability. The method's effectiveness in improving language-video representation is evaluated through text-video retrieval using the MSR-VTT dataset and several multi-modal retrieval models.
更健壮和完整的语言-视频表示是推动视频理解前进的关键。尽管在训练策略的提高方面取得了改善,但语言-视频数据集的质量并没有得到足够的关注。当前的简单和直接的文本描述和仅关注语言-视频任务的视觉化,导致在现实世界的自然语言视频检索任务中,查询非常复杂,查询能力有限。本文介绍了一种自动增强视频-语言数据集的方法,使其更具模态和上下文意识,以满足更复杂表示学习的需求,从而帮助所有下游任务。我们的多方面视频摘要方法捕捉实体、动作、语音转录、美学和情感线索,提供文本侧到视频侧的详细和相关的信息,以进行训练。我们还使用语言模型开发了一个类似代理的策略,以生成高质量的事实性文本描述,减少人类干预并实现可扩展性。通过评估通过MSR-VTT数据集进行的文本-视频检索,以及几种多模态检索模型,来评估方法改善语言-视频表示的有效性。
https://arxiv.org/abs/2406.13809