Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at this https URL
长时间视频包含大量的信息,使得视频-文本检索成为多模态学习中一个既重要又具挑战性的任务。然而,现有的基准测试由于视频时长有限、字幕质量低以及标注精细度不足等问题,在评估先进的视频-文本检索方法方面存在局限性。为解决这些问题,我们引入了LoVR这一专门用于长时间视频-文本检索的基准。 LoVR包含467段长时间视频和超过40,804个高质量细粒度片段。为了克服机器生成注释质量不佳的问题,我们提出了一种高效的字幕生成框架,该框架结合了视觉语言模型自动生成功能、字幕质量评分以及动态优化方法。这种流程在保证准确性的同时还保持了可扩展性。此外,我们引入了一个语义融合的方法来生成连贯的全视频描述而不丢失重要的上下文信息。 我们的基准测试提供了更长的视频、更为详细的注释和更大规模的数据集,为视频理解和检索带来了新的挑战。在各种先进的嵌入模型上的广泛实验表明,LoVR是一个具有挑战性的基准测试,揭示了现有方法的局限性,并为未来研究提供了有价值的见解。我们可以在这里(请将“this https URL”替换为实际链接)提供代码和数据集的访问。 通过这些改进,LoVR旨在促进视频-文本检索领域的发展,推动该领域的模型性能进一步提升。
https://arxiv.org/abs/2505.13928
Modern video understanding systems excel at tasks such as scene classification, object detection, and short video retrieval. However, as video analysis becomes increasingly central to real-world applications, there is a growing need for proactive video agents for the systems that not only interpret video streams but also reason about events and take informed actions. A key obstacle in this direction is temporal reasoning: while deep learning models have made remarkable progress in recognizing patterns within individual frames or short clips, they struggle to understand the sequencing and dependencies of events over time, which is critical for action-driven decision-making. Addressing this limitation demands moving beyond conventional deep learning approaches. We posit that tackling this challenge requires a neuro-symbolic perspective, where video queries are decomposed into atomic events, structured into coherent sequences, and validated against temporal constraints. Such an approach can enhance interpretability, enable structured reasoning, and provide stronger guarantees on system behavior, all key properties for advancing trustworthy video agents. To this end, we present a grand challenge to the research community: developing the next generation of intelligent video agents that integrate three core capabilities: (1) autonomous video search and analysis, (2) seamless real-world interaction, and (3) advanced content generation. By addressing these pillars, we can transition from passive perception to intelligent video agents that reason, predict, and act, pushing the boundaries of video understanding.
现代视频理解系统在场景分类、目标检测和短片检索等任务上表现出色。然而,随着视频分析在现实世界应用中的重要性日益增加,对于能够不仅解读视频流还能够对事件进行推理并采取明智行动的主动型视频代理的需求也在增长。这一方向上的一个关键障碍是时间推理:虽然深度学习模型在识别单帧或短片段内的模式方面取得了显著进展,但它们难以理解事件随时间排列和依赖关系,这对于基于行为的决策至关重要。解决这个限制需要超越传统的深度学习方法。我们认为,应对这一挑战需要采用神经符号学视角,即将视频查询分解为原子事件、结构化成连贯序列,并根据时间约束进行验证。这种做法可以增强可解释性,支持结构化的推理过程,并提供更有力的行为保证,这些都是构建值得信赖的视频代理的关键属性。 为此,我们向研究界提出了一个重大挑战:开发下一代智能视频代理,整合三个核心能力:(1) 自主视频搜索和分析;(2) 无缝的现实世界交互;以及 (3) 高级内容生成。通过解决这些支柱问题,我们可以从被动感知过渡到能够推理、预测并采取行动的智能视频代理,推动视频理解技术的发展边界。
https://arxiv.org/abs/2505.13851
Recent advances in text-video retrieval have been largely driven by contrastive learning frameworks. However, existing methods overlook a key source of optimization tension: the separation between text and video distributions in the representation space (referred to as the modality gap), and the prevalence of false negatives in batch sampling. These factors lead to conflicting gradients under the InfoNCE loss, impeding stable alignment. To mitigate this, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment Delta_ij between text t_i and video v_j to offload the tension from the global anchor representation. We first derive the ideal form of Delta_ij via a coupled multivariate first-order Taylor approximation of the InfoNCE loss under a trust-region constraint, revealing it as a mechanism for resolving gradient conflicts by guiding updates along a locally optimal descent direction. Due to the high cost of directly computing Delta_ij, we introduce a lightweight neural module conditioned on the semantic gap between each video-text pair, enabling structure-aware correction guided by gradient supervision. To further stabilize learning and promote interpretability, we regularize Delta using three components: a trust-region constraint to prevent oscillation, a directional diversity term to promote semantic coverage, and an information bottleneck to limit redundancy. Experiments across four retrieval benchmarks show that GARE consistently improves alignment accuracy and robustness to noisy supervision, confirming the effectiveness of gap-aware tension mitigation.
最近在文本-视频检索领域的进展主要受到对比学习框架的推动。然而,现有的方法忽视了一个重要的优化张力来源:即文本和视频在表示空间中的分布分离(称为模态差距)以及批量采样中假阴性的普遍存在。这些因素导致了在InfoNCE损失下的梯度冲突,阻碍了稳定的对齐过程。 为了解决这个问题,我们提出了GARE(Gap-Aware Retrieval),这是一种基于感知间隙的检索框架,它引入了一个可学习的、特定于每对的增量Delta_ij,在文本t_i和视频v_j之间。通过这种方式,它可以将张力从全局锚表示卸载下来。首先,我们在受信任区域约束的情况下通过耦合多元一阶泰勒近似导出了Delta_ij的理想形式,揭示了它作为解决梯度冲突机制的作用,引导沿着局部最优下降方向的更新。 由于直接计算Delta_ij的成本很高,我们引入了一个轻量级神经模块,该模块基于每对视频-文本之间的语义差距进行条件化处理。这使得能够通过梯度监督来进行结构感知校正。为了进一步稳定学习并促进可解释性,我们使用三个部分来规范Delta:一个信任区域约束以防止振荡,一个方向多样性项以促进语义覆盖,并一个信息瓶颈以限制冗余。 在四个检索基准上的实验表明,GARE能够一致地提高对齐准确性和对噪声监督的鲁棒性,证实了感知间隙张力缓解的有效性。
https://arxiv.org/abs/2505.12499
Images used in real-world applications such as image or video retrieval, outdoor surveillance, and autonomous driving suffer from poor weather conditions. When designing robust computer vision systems, removing adverse weather such as haze, rain, and snow is a significant problem. Recently, deep-learning methods offered a solution for a single type of degradation. Current state-of-the-art universal methods struggle with combinations of degradations, such as haze and rain-streak. Few algorithms have been developed that perform well when presented with images containing multiple adverse weather conditions. This work focuses on developing an efficient solution for multiple adverse weather removal using a unified quaternion neural architecture called CMAWRNet. It is based on a novel texture-structure decomposition block, a novel lightweight encoder-decoder quaternion transformer architecture, and an attentive fusion block with low-light correction. We also introduce a quaternion similarity loss function to preserve color information better. The quantitative and qualitative evaluation of the current state-of-the-art benchmarking datasets and real-world images shows the performance advantages of the proposed CMAWRNet compared to other state-of-the-art weather removal approaches dealing with multiple weather artifacts. Extensive computer simulations validate that CMAWRNet improves the performance of downstream applications such as object detection. This is the first time the decomposition approach has been applied to the universal weather removal task.
在实际应用中,如图像或视频检索、户外监控和自动驾驶等领域使用的图片会受到恶劣天气条件的影响。设计稳健的计算机视觉系统时,去除雾气、雨水和雪等不利天气因素是一个重要的问题。最近,深度学习方法为单一类型的退化提供了解决方案,但目前最先进的通用方法在面对如雾霾与雨迹这样的多种降质组合时仍面临挑战。很少有算法能在处理包含多种恶劣天气条件的图像时表现出色。 本研究专注于开发一种高效的多不良天气去除解决方案,采用名为CMAWRNet(Color and Multi-Adverse Weather Removal Network)的统一四元数神经架构。该方法基于一个新颖的纹理结构分解块、一个轻量级编码器解码器四元数转换器架构以及一个带有低光校正功能的关注融合模块,并引入了一个四元数相似性损失函数,以更好地保留颜色信息。 当前最先进的基准数据集和实际世界图像的定量及定性评估显示,与处理多种天气效果的其他方法相比,所提出的CMAWRNet表现更优。广泛的计算机模拟验证了CMAWRNet能提升下游应用(如物体检测)的表现。这是首次将分解方法应用于通用天气去除任务中。
https://arxiv.org/abs/2505.01882
AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.
AI驱动的视频分析在各个领域中变得越来越重要。然而,现有的系统通常局限于特定、预定义的任务,这限制了它们在开放性分析场景中的适应能力。最近出现的视频-语言模型(VLMs)作为变革技术,为实现开放性的视频理解、推理和分析提供了巨大潜力。然而,这些模型有限的上下文窗口在处理现实应用中常见的超长视频内容时遇到了挑战。为此,我们引入了AVA系统,这是一个基于VLM的高级视频分析解决方案,旨在应对开放性场景的需求。AVA包含了两个关键创新:(1)几乎实时构建事件知识图谱(EKGs),以高效地索引长时间或连续播放的视频流;(2)一种代理检索-生成机制,利用EKG处理复杂多样的查询。 在公共基准测试LVBench和VideoMME-Long上的全面评估显示,AVA达到了行业领先的性能水平,在这两个数据集上分别获得了62.3%和64.1%的准确率,远超现有的VLM系统以及视频检索增强生成(RAG)系统的成绩。此外,为了在超长及开放世界视频场景中评测视频分析技术,我们提出了一项新的基准测试——AVA-100。该基准包括8个视频片段,每个视频超过10小时,并配有总共120对经过人工标注的多样化且复杂的问答组合。 在AVA-100上,AVA以75.8%的准确率取得了顶级性能表现。
https://arxiv.org/abs/2505.00254
Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant video moment features and treating them as hard negative samples, thereby encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances feature discrimination by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments on three datasets demonstrate the superiority of our approach compared to previous methods, achieving state-of-the-art results.
部分相关视频检索(PRVR)的目标是从文本查询中提取出与之部分相关的目标视频。在PRVR中,主要挑战来自于文字和视觉模态之间的语义不对称性:由于视频通常包含大量无关查询内容的信息,因此难以精确匹配。 现有的方法通过粗略地对齐配对的视频和文本查询来构建语义空间,这忽略了任务内在的关键跨模态二元特性:样本间的关联性和样本内的冗余性。为此,我们提出了一种新的PRVR框架,旨在系统地利用这两种特征。我们的框架包括三个核心模块: 1. **交互相关增强(ICE)** 模块通过识别语义相似但未配对的文本查询和视频片段来捕捉样本间的关联性,并将它们组合成伪正向对以构建更稳健的语义空间。 2. **内部冗余挖掘(IRM)** 模块通过挖掘视频片段特征中的冗余信息并将其视为难负例样本来缓解样本内的冗余问题,从而鼓励模型学习更具区分性的表示方法。 3. 为了强化这些模块的效果,我们引入了**时间一致性预测(TCP)** 模块。该模块通过对随机打乱的视频帧和片段进行时间顺序的预测来增强特征的区分性,以此训练模型。 在三个数据集上的广泛实验结果表明,与先前的方法相比,我们的方法具有明显的优势,并取得了最新的研究成果。
https://arxiv.org/abs/2504.19637
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at this https URL.
视频文本数据的快速增长在训练过程中带来了存储和计算上的挑战。在线学习通过实时处理流式数据,为这些问题提供了一个有前景的解决方案,并且能够在需要实时响应的场景中实现迅速适应。为了提高学习效率和效果,一种策略是识别并优先处理能够提升特定下游任务性能的数据。我们提出了一种基于相关性和具体性的在线过滤框架(ReSpec),该框架根据以下四个标准选择数据:(i) 清晰数据的模态对齐;(ii) 目标聚焦数据的任务关联性;(iii) 对于具有信息量和细节性的数据的具体性;以及(iv) 低延迟处理的效率。相关性通过入流数据与下游任务的概率对齐来确定,而具体性则利用距离代表最低特定度数据的根嵌入作为信息量的有效代理。通过从目标任务数据中建立参考点,ReSpec 实时过滤进入的数据,从而减少了对于大量存储和计算资源的需求。 在大规模数据集WebVid2M 和 VideoCC3M 上进行评估时,即使只使用5% 的数据并且几乎不增加计算成本的情况下,ReSpec 在五个零样本视频检索任务中达到了最先进的性能。源代码可在提供的链接获取。
https://arxiv.org/abs/2504.14875
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
在检索系统中,同时实现搜索准确性和效率是非常具有挑战性的。这一问题尤其体现在部分相关视频检索(PRVR)中,在此场景下,为了提升准确性而加入更多样化的上下文表示会随时间尺度变化增加计算和内存成本。为了解决这一矛盾,我们提出了一种原型式的PRVR框架,该框架将视频中的多样化背景信息编码为固定数量的原型。接下来,我们引入了几种策略来增强文本关联及对原型中视频的理解,并设置了一个正交目标以确保这些原型能够捕捉到内容的多样性。为了在通过文字查询搜索时保持原型可查找性的同时准确地编码视频上下文,我们实施了跨模态和单模态重构任务。其中,跨模态重构任务将原型与文本特征对齐于共享空间中,而单模态重构任务则保留所有视频背景信息的完整性。此外,我们还采用了一种视频混合技术来提供弱引导以进一步对齐原型与其相关的文本表示。我们在TVR、ActivityNet-Captions和QVHighlights数据集上的广泛评估验证了我们的方法的有效性,并且在效率方面没有牺牲性能。
https://arxiv.org/abs/2504.13035
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
部分相关视频检索(PRVR)是文本到视频检索中的一项实用且具有挑战性的任务,其中的视频未经过修剪,并包含大量背景内容。该领域的目标在于寻找既有效又高效的解决方案,以捕捉文字查询与未修剪视频之间的局部对应关系。然而,现有的PRVR方法通常侧重于多尺度片段表示建模,但这些方法往往受到内容独立性和信息冗余的影响,从而损害了检索性能。 为了克服这些限制,我们提出了一种简单而有效的主动时刻发现(AMDNet)方法。我们的目标是发现与查询语义一致的视频时刻。通过使用可学习的时间间隔锚点来捕捉不同的时刻,并应用掩码多时刻注意力机制以强调显著性时刻同时抑制冗余背景,我们可以实现更为紧凑和信息丰富的视频表示。 为进一步增强时刻建模,我们引入了两种损失函数:一种是多样性损失(moment diversity loss),鼓励不同区域中不相同的时刻;另一种是相关性损失(moment relevance loss),促进与查询语义相关的时刻。这两种损失协同工作,并与部分相关检索损失共同作用于端到端优化。 在两个大规模视频数据集(即TVR和ActivityNet Captions)上的广泛实验表明,我们的AMDNet方法具有优越性和效率。具体而言,在TVR数据集中,AMDNet比当前最先进的GMMFormer方法小约15.5倍(参数数量),同时性能高出6.0分(SumR)。
https://arxiv.org/abs/2504.10920
The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.
数字视频内容的指数级增长为在大规模视频库中检索特定片段(即时刻级别视频检索)带来了重大挑战。现有的方法难以高效地定位这些片段,特别是在处理庞大的视频数据集时。当前的检索系统受到计算效率低下、时间上下文限制以及导航视频内容固有复杂性的制约。 本文提出了一种新的交互式视频库时刻检索框架,该框架结合了超级全局重排序机制和自适应双向时间搜索(ABTS),在查询相似性、时间稳定性及计算资源之间进行策略性优化。通过使用关键帧提取模型对大量视频数据集进行预处理,并采用图像哈希技术去除重复内容,我们的方法提供了一种可扩展的解决方案,显著减少了存储需求,同时在多种视频库中保持了高精度定位。 这种方法不仅提高了检索效率和准确性,而且还为用户提供了更丰富的交互体验。
https://arxiv.org/abs/2504.09298
Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.
长视频理解对于交互式检索系统来说存在重大挑战,因为传统方法难以高效地处理大量视频内容。现有的方法通常依赖单一模型、低效的存储方式、不稳定的时序搜索以及缺乏上下文感知的重排序,这些都限制了它们的效果。本文提出了一种新颖的框架,通过四个关键创新来增强交互式视频检索:(1) 一种集成粗粒度(CLIP)和细粒度(BEIT3)模型的集合搜索策略,以提高检索准确性;(2) 一种存储优化技术,通过选择具有代表性的关键帧并通过TransNetV2去重的方式来减少冗余;(3) 一个使用起始点和结束点双查询来定位视频片段的时序搜索机制;以及 (4) 利用相邻帧上下文以稳定排名的时序重排序方法。在已知项检索和问答任务上进行评估,我们的框架展示了显著提升的检索精度、效率和用户可解释性,为实际交互式视频检索应用提供了稳健的解决方案。
https://arxiv.org/abs/2504.08384
Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.
精确的视频检索需要多模态关联来处理未见过的词汇和场景,在长视频中,模型必须在没有特定数据集预训练的情况下有效工作,这使得问题变得更加复杂。我们提出了一种统一框架,结合了视觉匹配流与基于字幕的视频分割方法以及听觉匹配流。此外,听觉流还包括一种增强长时间视频检索性能的音频基两阶段检索机制。鉴于从长视频中检索信息及其评估的复杂性,我们引入了一种专门针对长视频检索设计的新评价方法,以支持进一步的研究。我们在YouCook2基准上进行了实验,展示了有前景的检索性能。
https://arxiv.org/abs/2504.04572
Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags -- automatically extracted from foundation models -- to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts, derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, and so are able to distinguish concepts from one other. We conduct extensive experiments on five diverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across the other two.
视频检索需要将视觉内容与相应的自然语言描述对齐。在本文中,我们介绍了用于视频检索的模态辅助概念(MAC-VR),这是一种新颖的方法,它利用了从基础模型自动提取的特定于模态的标签来增强视频检索。我们提出了一种方法,在潜在空间中对齐模态,并且通过学习和对齐来自视频及其相应字幕特征派生出的辅助潜在概念来进行这种对齐。我们引入这些辅助概念以改善视觉和文本潜在概念之间的对齐,从而使我们可以区分彼此的概念。 我们在五个多样化的数据集上进行了广泛的实验:MSR-VTT、DiDeMo、TGIF、Charades 和 YouCook2。实验结果一致地表明,特定于模态的标签可以提高跨模态对齐性能,在三个数据集中优于当前最先进的方法,并且在其他两个数据集中表现出色或相当。
https://arxiv.org/abs/2504.01591
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
在这项工作中,我们解决了文本到视频检索(T2VR)的问题。受文本文档、文本图像和文本视频检索中后期交互技术成功的启发,我们的方法Video-ColBERT引入了一种简单而高效的机制,用于查询与视频之间的细粒度相似性评估。Video-ColBERT基于三个主要组件构建:精细的空间和时间标记级交互、查询和视觉扩展以及训练期间的双重Sigmoid损失。我们发现这种交互和训练范式导致了编码视频内容的强大且兼容的表示形式。这些表示形式在常见的文本到视频检索基准测试中优于其他双编码器方法,提高了性能。
https://arxiv.org/abs/2503.19009
The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.
视频内容的快速增长要求高效的检索系统。虽然视觉-语言模型(VLM)在表示学习方面表现出色,但它们通常难以处理适应性强、时效性高的视频检索需求。本文介绍了一种结合向量相似度搜索和基于图的数据结构的新框架。通过利用VLM嵌入进行初始检索,并建模视频片段之间的上下文关系,我们的方法能够实现自适应查询细化并提高检索准确性。实验结果证明了该方法在精度、可扩展性和鲁棒性方面的有效性,为动态环境中的交互式视频检索提供了一种有效解决方案。
https://arxiv.org/abs/2503.17415
Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.
长视频理解对于视频检索、摘要生成和问答等应用至关重要。然而,传统方法需要大量的计算资源,并且常常受到GPU内存的限制。为了解决这一问题,我们提出了一种名为Long-Video Memory Network(简称Long-VMNet)的新颖视频理解方法。该方法采用固定大小的记忆表示来存储从输入视频中采样的具有区分性的补丁(patch)。通过利用一个能够识别具有区分性令牌(token)的神经采样器,Long-VMNet实现了更高的效率。此外,Long-VMNet只需对视频进行一次扫描,极大地提高了效率。在Rest-ADL数据集上的实验结果表明,在长视频检索和问答任务上,Long-VMNet相比传统方法提升了18到75倍的推理速度,并且预测性能仍然保持竞争力。
https://arxiv.org/abs/2503.13707
Text-to-Video Retrieval (TVR) aims to match videos with corresponding textual queries, yet the continual influx of new video content poses a significant challenge for maintaining system performance over time. In this work, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to overcome these limitations. Our analysis reveals that current TVR methods based on pre-trained models struggle to retain plasticity when adapting to new tasks, while existing continual learning approaches experience catastrophic forgetting, resulting in semantic misalignment between historical queries and stored video features. To address these challenges, we propose StableFusion, a novel CTVR framework comprising two main components: the Frame Fusion Adapter (FFA), which captures temporal dynamics in video content while preserving model flexibility, and the Task-Aware Mixture-of-Experts (TAME), which maintains consistent semantic alignment between queries across tasks and the stored video features. Comprehensive evaluations on two benchmark datasets under various task settings demonstrate that StableFusion outperforms existing continual learning and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks in the context of continuous video streams. Our code is available at: this https URL
文本到视频检索(TVR)旨在将视频与相应的文字查询匹配,然而,新视频内容的不断涌入给保持系统的长期性能带来了巨大挑战。在这项工作中,我们介绍了第一个用于持续文本到视频检索(CTVR)的基准测试,以克服这些限制。我们的分析表明,基于预训练模型的当前TVR方法在适应新任务时难以保留可塑性,而现有的连续学习方法则会经历灾难性的遗忘现象,导致历史查询与存储视频特征之间的语义不一致。 为了解决这些挑战,我们提出了StableFusion,这是一个新的CTVR框架,包括两个主要组件:帧融合适配器(FFA),它在捕获视频内容中的时间动态的同时保持模型的灵活性;以及任务感知专家混合体(TAME),它能够在不同的任务之间维持查询与存储视频特征之间的语义一致性。 我们在两个基准数据集上进行了广泛的评估,涵盖各种任务设置,结果表明StableFusion超越了现有的连续学习和TVR方法,在持续视频流中实现了更好的检索性能,并且对早期任务的退化最小。我们的代码可在以下网址获取:[提供链接](请使用正确的URL替换占位符)。
https://arxiv.org/abs/2503.10111
Integrating audio and visual data for training multimodal foundational models remains challenging. We present Audio-Video Vector Alignment (AVVA), which aligns audiovisual (AV) scene content beyond mere temporal synchronization via a Large Language Model (LLM)-based data curation pipeline. Specifically, AVVA scores and selects high-quality training clips using Whisper (speech-based audio foundation model) for audio and DINOv2 for video within a dual-encoder contrastive learning framework. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate that this approach can achieve significant accuracy gains with substantially less curated data. For instance, AVVA yields a 7.6% improvement in top-1 accuracy for audio-to-video retrieval on VGGSound compared to ImageBind, despite training on only 192 hours of carefully filtered data (vs. 5800+ hours). Moreover, an ablation study highlights that trading data quantity for data quality improves performance, yielding respective top-3 accuracy increases of 47.8, 48.4, and 58.0 percentage points on AudioCaps, VALOR, and VGGSound over uncurated baselines. While these results underscore AVVA's data efficiency, we also discuss the overhead of LLM-driven curation and how it may be scaled or approximated in larger domains. Overall, AVVA provides a viable path toward more robust, text-free audiovisual learning with improved retrieval accuracy.
将音频和视觉数据整合用于训练多模态基础模型仍然面临挑战。我们提出了一种名为音视频向量对齐(Audio-Video Vector Alignment,AVVA)的方法,该方法通过基于大型语言模型(LLM)的数据整理管道,在超越单纯时间同步的基础上实现了视听场景内容的对齐。具体来说,AVVA利用Whisper(一种基于语音的音频基础模型)处理音频,并使用DINOv2处理视频,在一个双编码器对比学习框架中评分和选择高质量的训练片段。 在AudioCaps、VALOR和VGGSound数据集上的评估表明,这种方法能够在大幅减少整理的数据量的情况下实现显著的准确率提升。例如,在与ImageBind相比时,AVVA仅使用了192小时精心过滤后的数据(相比之下,ImageBind使用的是5800+小时),在VGGSound上实现了音频到视频检索任务7.6%的top-1精度提升。 此外,消融实验表明,以牺牲数据量为代价提高数据质量可以改善性能,在AudioCaps、VALOR和VGGSound上的top-3准确率分别比未经整理的基础线模型提高了47.8%,48.4% 和58.0个百分点。 尽管这些结果强调了AVVA的数据效率,但我们也讨论了由LLM驱动的整理所带来的额外开销以及如何在更大领域内对其进行扩展或近似处理。总体而言,AVVA为实现更为稳健、无需文本引导的视听学习和检索精度提升提供了一条可行的道路。
https://arxiv.org/abs/2503.09205
In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.
在最近的文本-视频检索领域,使用视觉语言模型生成的附加字幕已经显示出对性能有积极的影响。然而,现有的利用这些额外字幕的模型往往难以捕捉到视频中固有的丰富语义信息,包括时间变化等因素。此外,由生成模型造成的错误信息会导致检索不准确。为了解决这些问题,我们提出了一种新的框架——叙述视频(NarVid),该框架战略性地利用了从每一帧级别的描述中获得的全面信息,即“叙述”。提出的NarVid在多个方面充分利用了这种叙述:1)通过叙述与视频之间的跨模态交互增强特征;2)基于查询的自适应过滤以抑制无关或错误的信息;3)双模匹配得分,通过将查询-视频相似度和查询-叙述相似度相加来计算;4)利用这两种相似性从不同视角学习区分特性,并引入难例损失。实验结果表明,NarVid在各种基准数据集上达到了最先进的性能水平。
https://arxiv.org/abs/2503.05186
Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
通用多模态嵌入模型在交错的图像-文本检索、多模态RAG(Retrieval-Augmented Generation)和多模态聚类等任务中扮演着关键角色。然而,我们的实证研究表明,基于现有LMM(Large Multimodal Model)并使用标准InfoNCE损失函数训练的嵌入模型,在正负对之间的相似性分布存在高度重叠的问题,这使得有效区分难例负样本变得困难。 为了解决这一问题,我们提出了一种简单而有效的框架,该框架能够根据负样本的判别难度动态改进嵌入模型的表现学习能力。在这一框架内,我们训练了一系列模型,命名为LLaVE,并在覆盖4个元任务和36个数据集的MMEB基准上对其进行了评估。实验结果显示,LLaVE建立了更强的基础模型,在实现SOTA(State-of-the-Art)性能的同时还展示了强大的扩展性和效率。具体来说,LLaVE-2B超过了之前的7B SOTA模型,而LLaVE-7B则进一步实现了6.2分的性能提升。 尽管LLaVE是基于图像-文本数据进行训练的,但它可以以零样本的方式推广到文本-视频检索任务,并取得强大的性能表现,这表明它在转移到其他嵌入任务中具有显著潜力。
https://arxiv.org/abs/2503.04812