The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.
组合视频检索(CoVR)的目标是基于查询视频和修改文本来检索相关视频。当前的CoVR方法未能充分利用现代视觉-语言模型(VLM),要么使用过时的架构,要么需要昂贵的计算资源进行微调,并且会产生缓慢的字幕生成过程。我们引入了PREGEN(PRE GENeration extraction),这是一种高效的组合视频检索框架,能够克服这些限制。我们的方法独特地将一个冻结、预训练的VLM与一个轻量级编码模型配对,从而完全消除了对任何VLM微调的需求。我们将查询视频和修改文本输入到VLM中,并从每一层提取最终令牌的隐藏状态。然后在这些汇集表示上训练一个简单的编码器,创建用于检索的语义丰富且紧凑的嵌入。PREGEN显著地推进了当前的技术水平,在标准CoVR基准测试上超越所有先前的方法,Recall@1分别提高了27.23和69.59个百分点。我们的方法在不同的VLM骨干网络上表现出强大的鲁棒性,并展示了对更复杂文本修改的零样本泛化能力,突显了其有效性和语义处理能力。
https://arxiv.org/abs/2601.13797
Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models which are trained on orders of magnitude larger data.
现代视频检索系统被期望能够处理从语料库级别的检索和细粒度时刻定位到灵活的多模态查询等多样化任务。专门架构通过在大规模数据集上训练特定模式的编码器来实现强大的检索性能,但它们缺乏处理复杂组合多模态查询的能力。相比之下,基于多模态大型语言模型(MLLM)的方法支持丰富的多模态搜索,但其检索性能仍远低于专业系统。 我们提出了VIRTUE,这是一种基于MLLM的多功能视频检索框架,在单一架构中集成了语料库和时刻级别的检索能力,并且能够处理组合多模态查询。我们使用共享MLLM骨干网生成视觉和文本嵌入并进行对比对齐,以促进高效的基于嵌入的候选搜索。 我们的嵌入模型在70万对视觉-文本数据样本上通过低秩适应(LoRA)高效训练后,在零样本视频检索任务中超越了其他所有基于MLLM的方法。此外,我们还展示了该模型可以在不进行进一步训练的情况下调整以实现与最佳水平相当的零样本时刻检索结果,并且在零样本组合视频检索方面也达到了最先进的表现。 通过为嵌入搜索中确定的候选对象进行额外重新排序培训,我们的模型显著超越了现有的基于MLLM的检索系统,并实现了与专用于大规模数据的专业模型相媲美的检索性能。
https://arxiv.org/abs/2601.12193
The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot's decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM's candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.
自主机器人车辆的安全验证依赖于对其规划和控制系统进行系统性测试,尤其是在罕见的、关键安全场景下。因此,从大规模的真实世界驾驶日志中挖掘这些长尾事件是机器人开发生命周期中的一个关键步骤。情景挖掘任务的目标是从自然语言描述的情景中检索有用的信息,以支持针对特定目标的重新模拟、回归测试和故障分析。 RefAV是由Argoverse团队提出的一种端到端框架,该框架使用大型语言模型(LLMs)来空间和时间定位用自然语言描述的情景。然而,这一过程基于轨迹标签进行检索,忽略了自然语言与原始RGB图像之间的直接联系,这违背了视频检索的直觉;此外,它还依赖于上游3D目标检测和跟踪的质量。不准确的轨迹数据会导致下游的空间和时间定位不准确。 为了解决这些问题,我们提出了从粗到细的情景挖掘方法(SMc2f),这是一种采用视觉-语言模型(VLMs)进行图像文本过滤、在RefAV基础上构建成功挖掘案例数据库并自动检索实例以对LLM进行少样本条件化从而提高检索鲁棒性,并引入了基于文本和轨迹的对比学习,将匹配对拉近并在共享嵌入空间中将不匹配对推开的方法。这种方法最终形成了一种细粒度匹配器,用于精炼LLM候选轨迹。 在公共数据集上的实验表明,在检索质量和效率方面取得了显著提升。
https://arxiv.org/abs/2601.12010
Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: this https URL
相机控制的生成视频重渲染方法,如ReCamMaster,已经取得了显著的进步。然而,在单一视角设置下尽管这些方法表现出色,但在多视角场景中却往往难以保持一致性。由于生成模型固有的随机性,确保幻觉区域的空间和时间连贯性仍然是一项挑战。为了解决这个问题,我们提出了PlenopticDreamer框架,该框架通过同步生成的幻象来维持空间和时间记忆的一致性。核心思想是训练一个以多输入单输出视频条件模型自回归方式进行训练,并辅以相机引导的视频检索策略,这种策略可以动态地从先前的生成中选择显著视频作为条件输入。此外,在训练过程中还加入了渐进式上下文扩展来提高收敛速度,自我调节来增强对长时间视觉退化(由错误累积引起)的鲁棒性,以及支持长视频生成的长期视频条件机制。在Basic和Agibot基准测试上的广泛实验表明,PlenopticDreamer实现了最先进的视频重渲染性能,在视角同步、高保真视觉效果、准确的相机控制以及多样化的视图转换(如第三人称到第三人称,或机器人操作中的头部视图到机械臂视图)等方面表现出色。项目页面:[此链接](https://this-url.com/)
https://arxiv.org/abs/2601.05239
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
自动识别事件和重复行为分析对于视频监控至关重要。然而,大多数现有的基于内容的视频检索基准测试主要关注场景级别的相似性,并且不评估在监控中所需的动作辨别能力。为了弥补这一不足,我们引入了SOVABench(监控相对车辆动作基准),这是一个从监控录像构建的真实世界检索基准,专注于与车辆相关的行动。SOVABench定义了两种评估协议(跨对和内对)来评价跨动作的区分能力和时间方向的理解力。虽然动作之间的区别对于人类观察者来说通常是直观的,但我们的实验表明,这些任务仍然为最先进的视觉和多模态模型带来了挑战。 通过利用多模态大型语言模型(MLLMs)在视觉推理和指令遵循方面的能力,我们提出了一种无需训练框架,用于从MLLM生成的描述中生产出具有解释性的图像和视频嵌入。该框架在SOVABench上实现了强大的性能,并且还在多个空间和计数基准测试中表现出色,这些地方对比度视觉语言模型常常失败。 该项目的代码、标注以及构建基准的说明已公开提供。
https://arxiv.org/abs/2601.04824
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at this https URL.
大型视频语言模型(LVLMs)作为多媒体AI研究的焦点迅速崛起。然而,面对长视频时,这些模型遇到了困难:它们的时间窗口狭窄,无法捕捉到长时间内细微的语义变化。此外,主流基于文本的检索流程主要依赖于表面级别的词汇重叠,忽视了视觉、音频和字幕通道之间的丰富时间相互依存关系。为了缓解这些问题,我们提出了TV-RAG架构,这是一种无需训练的方法,结合了时序对齐与熵引导语义来提升长视频推理能力。该框架贡献了两种主要机制:(i)一种时间衰减检索模块,在相似度计算中注入明确的时间偏移量,从而根据真实的多媒体背景排列文本查询;(ii)一种基于熵加权的关键帧采样器,选择均匀间隔且信息密集的帧,减少了冗余同时保持代表性。通过将这些时序和语义信号编织在一起,TV-RAG实现了双层推理流程,可以在不重新训练或微调的情况下与任何LVLM集成。由此产生的系统提供了一种轻量级、预算友好的升级途径,并在Video-MME、MLVU 和 LongVideoBench 等公认的长视频基准测试中持续超越大多数领先基线,证明了我们模型的有效性。该代码可在[此链接](https://this-is-the-url-you-need-to-replace.com)找到。
https://arxiv.org/abs/2512.23483
Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
我们的目标是构建一个用于检索的时间感知型视频-文本嵌入模型。为此,我们提出了一种简单而高效的方案,称为TARA(Time Aware Retrieval Adaptation),可以在不使用任何视频数据的情况下将多模态大型语言模型(MLLMs)适应为时间感知型的视频-文本嵌入模型。为了评估检索中的时间感知性,我们提出了一个新的基准测试,该基准使用时间上相反的动作作为难例(chiral actions 的硬负样本),并针对时间相反和非时间相反的动作进行了精心设计的数据集划分。 结果显示,TARA在这一新的时间相反动作基准测试中优于所有现有的视频-文本模型,并且还在标准基准测试中取得了优异的成绩。此外,我们还发现了TARA除了具备时间感知性之外的额外优势:(i) TARA生成的嵌入是具有否定意识的,在NegBench基准测试(该测试评估视频检索中的否定语义)中得到了验证;(ii) TARA在理解视频中的动词和副词方面达到了最佳性能。 总体而言,TARA提供了一个强大的、多功能的时间感知型视频-文本嵌入模型,并且在零样本设置下表现出色。
https://arxiv.org/abs/2512.13511
The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal rea- soning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.
在线平台上视频内容的迅速扩张加速了对能够理解不仅孤立的视觉时刻,还能理解复杂事件时间结构的检索系统的需求。现有方法在建模多事件间的时间依赖性和处理引用未见过或罕见视觉概念的查询方面常常表现不足。为了解决这些挑战,我们团队AIO_Trinh开发了一种名为MADTempo的视频检索框架,该框架将时间搜索与大规模网络图像识别统一起来。我们的时序搜索机制通过聚合连续视频片段间的相似性评分来捕捉事件级别的连贯性,从而能够有效地检索多事件查询。此外,一个基于Google图片搜索的后备模块利用外部网络图像扩充查询表示,有效填补预训练视觉嵌入中的空白,并提高了对抗分布外(OOD)查询的能力。这些组件共同推进了现代视频检索系统的时序推理和泛化能力,在大规模视频语料库中实现了更具有语义意识和适应性的检索方式。
https://arxiv.org/abs/2512.12929
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
我们介绍了一种基于联合嵌入预测架构(JEPA)的视觉-语言模型——VL-JEPA。不同于传统的视觉-语言模型(VLMs),在生成时逐个自回归地产生标记,VL-JEPA 预测目标文本的连续嵌入。通过在一个抽象表示空间中学习,该模型专注于任务相关的语义,并忽略表面级的语言变异性。在使用相同的视觉编码器和训练数据进行严格控制对比的情况下,与标准词元空间VLM训练相比,VL-JEPA 达到了更强的性能表现,同时其可训练参数减少了50%。推理时,当需要将VL-JEPA预测的嵌入转换为文本时才调用轻量级文本解码器。 我们展示了VL-JEPA原生支持选择性解码,在保持与非自适应均匀解码相似性能的同时,减少了解码操作次数2.85倍。除了生成任务外,VL-JEPA 的嵌入空间自然地支持开放词汇分类、文本到视频检索和判别式视觉问答(VQA),无需进行任何架构修改。 在八个视频分类数据集和八个视频检索数据集中,平均性能表明VL-JEPA超越了CLIP、SigLIP2和感知编码器。同时,在GQA、TallyQA、POPE 和 POPEv2这四个VQA 数据集上,尽管参数量仅为1.6B,其性能与经典的视觉-语言模型(如InstructBLIP, QwenVL)相当。
https://arxiv.org/abs/2512.10942
Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on this https URL.
组成视频检索(Composed Video Retrieval,CVR)是一项具有挑战性的视频检索任务,它利用包含参考视频和修改文本的多模态查询来检索所需的靶标视频。这项任务的核心在于理解多模态组合查询并实现准确的组合特征学习。在多模态查询中,视频模式通常携带比文本模式更丰富的语义内容。然而,先前的工作大多忽视了这两种模式之间信息密度差异的问题。这一限制可能导致两个关键问题:1)修改对象指代模糊性;2)有限的具体语义关注点,这些问题都会降低CVR模型的性能。 为解决上述问题,我们提出了一种新颖的CVR框架——分层不确定性感知消歧网络(Hierarchical Uncertainty-aware Disambiguation network,HUD)。HUD是第一个利用视频和文本信息密度差异来增强多模态查询理解的框架。它包含三个关键组成部分:(a) 全局代词消歧;(b) 原子级不确定性建模;(c) 全局到原子级对齐。通过全局跨模式交互利用重叠语义,并通过原子级跨模式交互实现精细语义对齐,HUD能够有效进行对象消歧并增强对具体语义的关注点,从而实现精确的组合特征学习。 此外,我们提出的HUD框架也适用于组成图像检索(Composed Image Retrieval,CIR)任务,在三个基准数据集上均取得了在CVR和CIR任务上的最佳性能。相关代码可在此链接访问:[https URL]。
https://arxiv.org/abs/2512.02792
In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.
近年来,在视频检索和视频时刻检索任务上取得了显著进展,这些任务分别根据给定的文本查询来搜索完整的视频或特定时刻。这些进步极大地提高了用户在搜索过程中的满意度。然而,之前的工作未能建立有意义的“互动”机制,其单向检索模式已无法充分满足至少80.8%用户的个性化和动态需求。 在这篇论文中,我们介绍了交互式视频语料库检索(IVCR)任务,这是一个更加现实的情景,它允许用户与检索系统之间进行多轮、对话式的、真实的互动。为了支持这一具有挑战性的研究任务,我们引入了IVCR-200K,这是一项高质量的双语、多回合、对话式和抽象语义数据集,支持视频检索甚至时刻检索。此外,我们提出了一种基于多模态大语言模型(MLLMs)的全面框架,帮助用户以多种方式与系统互动,并提供更具解释性的解决方案。 广泛的实验结果验证了我们的数据集和框架的有效性。
https://arxiv.org/abs/2512.01312
The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).
https://arxiv.org/abs/2511.12371
The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.
https://arxiv.org/abs/2511.12255
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
https://arxiv.org/abs/2511.03332
We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.
我们介绍V-Agent,这是一个新颖的多代理平台,专为高级视频搜索和用户系统交互对话而设计。通过使用一个小规模的视频偏好数据集对视觉-语言模型(VLM)进行微调,并结合从图像文本检索模型中提取的检索向量,我们克服了传统基于文本的检索系统在跨模态场景中的局限性。该VLM基的检索模型独立地将视频帧和自动语音识别(ASR)模块生成的文字转录嵌入到一个共享的多模式表示空间中,使V-Agent能够理解视觉内容和口语表达,从而实现情境感知的视频搜索。这个系统包括三个代理——路由代理、搜索代理和聊天代理,它们协作工作以满足用户的需求,通过优化检索输出并与用户沟通来解决问题。搜索代理利用基于VLM的检索模型,并结合额外的重排序模块进一步提高视频检索质量。我们提出的框架在MultiVENT 2.0基准测试中展示了最先进的零样本性能,突显了其在学术研究和实际应用中的潜力。
https://arxiv.org/abs/2512.16925
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: this https URL
https://arxiv.org/abs/2511.01617
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
https://arxiv.org/abs/2510.27571
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
https://arxiv.org/abs/2510.27432
We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at this https URL.
https://arxiv.org/abs/2510.21406