Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).
https://arxiv.org/abs/2604.12148
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at this https URL.
https://arxiv.org/abs/2604.08147
We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.
我们首次对视频表示上的稀疏自编码器(SAEs)进行了系统性研究。标准SAEs能将视频分解为可解释的单语义特征,但破坏了时间连贯性:硬TopK选择导致帧间特征分配不稳定,使自相关性降低36%。我们提出了时空对比目标与马特罗什卡分层分组方法,恢复并甚至超越了原始的时间连贯性。对比损失权重可调节重建效果与时间连贯性之间的权衡。在两个主干网络和两个数据集上的系统性消融实验表明,不同配置在不同目标上各具优势:重建保真度、时间连贯性、动作判别或可解释性。对比SAE特征相比原始特征将动作分类性能提升3.9%,文本-视频检索性能最高提升至2.8倍R@1。跨主干分析揭示标准单语义性指标存在主干对齐伪影:在CLIP相似度下,DINOv2与VideoMAE产生的单语义特征程度相当。因果消融实验证实,对比训练将预测信号集中到少量可识别的特征中。
https://arxiv.org/abs/2604.03919
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at this https URL.
https://arxiv.org/abs/2604.03653
Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent's corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE's generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.
大型视觉语言模型赋予了GUI智能体强大的界面理解与交互通用能力。然而,由于训练过程中缺乏领域特定软件操作数据的充分暴露,这些智能体表现出显著的领域偏见——它们不熟悉特定应用的具体操作流程(规划)和UI元素布局(定位),从而限制了其在真实任务中的性能。本文提出GUIDE(基于教学视频驱动专业知识的GUI去偏),一个无需训练、即插即用的框架,通过检索增强的自动标注流水线从网络教学视频中自主获取领域专业知识,从而解决GUI智能体的领域偏见问题。GUIDE引入了两项关键创新:首先,基于字幕的视频检索增强生成(Video-RAG)流水线通过字幕分析解锁视频语义,执行渐进式三阶段检索——领域分类、主题提取和相关性匹配——以识别任务相关的教学视频;其次,基于逆动力学范式的全自动标注流水线将增强UI元素检测的连续关键帧输入视觉语言模型,推断所需的规划和定位知识,并将其注入智能体的相应模块,以解决领域偏见的两种表现形式。在OSWorld上的大量实验表明,GUIDE作为即插即用组件对多智能体系统和单模型智能体均具有通用性。它在不修改任何模型参数或架构的情况下, consistently带来超过5%的性能提升并减少执行步骤,验证了GUIDE作为架构无关增强方案在弥合GUI智能体领域偏见方面的有效性。
https://arxiv.org/abs/2603.26266
Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at this https URL.
文本-视频检索任务因近期大规模视觉-语言预训练模型的发展而取得显著进步。传统方法主要关注视频表示或跨模态对齐,而近期研究转向丰富文本表达以更好匹配视频中的丰富语义。然而,这些方法仅利用文本与帧/视频之间的交互,忽略了视频内部帧之间的丰富关联,导致最终扩展的文本无法捕获帧上下文信息,造成文本与视频之间的不匹配。针对此问题,我们提出能量感知细粒度关系学习网络(EagleNet),以生成准确且具备上下文感知能力的丰富文本嵌入。具体而言,所提出的细粒度关系学习机制(FRL)首先通过生成的文本候选帧与视频帧构建文本-帧图,进而学习文本与帧之间的关系,最终将这些关系用于聚合文本候选,形成融合帧上下文信息的丰富文本嵌入。为进一步提升FRL中的细粒度关系学习,我们设计能量感知匹配(EAM)对文本-帧交互的能量进行建模,从而准确捕获真实文本-视频对的分布。此外,为实现更有效的跨模态对齐与稳定训练,我们将传统的基于softmax的对比损失替换为sigmoid损失。大量实验表明,EagleNet在MSRVTT、DiDeMo、MSVD和VATEX数据集上均具有优越性。代码已公开于此链接。
https://arxiv.org/abs/2603.25267
Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.
当前大规模视频数据集聚焦于通用人类活动,但在物理技能学习所需的细粒度动作覆盖深度上存在不足。我们推出SportSkills,首个面向物理技能学习的大规模体育数据集,包含真实场景视频。该数据集拥有超过36万条教学视频,涵盖55项不同运动,提供超过63万条视觉演示,并配有解释动作背后技巧的教学旁白。通过一系列实验,我们证明SportSkills使模型能够理解物理动作间的细粒度差异。在相同模型下,相较于传统以活动为中心的数据集,我们的表征方法最高可提升4倍性能。关键的是,基于SportSkills,我们首次提出了大规模“基于错误条件的教学视频检索”任务框架,将表征学习与可操作反馈生成相连接(例如:“这是我的技能执行录像;我应该观看哪个视频片段来改进?”)。专业教练的正式评估表明,我们的检索方法显著提升了视频模型对用户查询进行个性化视觉指导的能力。
https://arxiv.org/abs/2603.25163
Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.
从未修剪视频中检索部分相关片段仍面临两大持续挑战:文本与视频片段间的信息密度不匹配,以及有限的注意力机制会忽略语义焦点与事件关联。我们提出KDC-Net(知识精炼的双视角上下文感知网络),从文本与视觉双视角应对这些问题。在文本侧,层次化语义聚合模块捕获并自适应融合多尺度短语线索以丰富查询语义;在视频侧,动态时序注意力机制采用相对位置编码与自适应时序窗口,突出具有局部时序连贯性的关键事件。此外,结合时序连续性感知优化的动态CLIP蒸馏策略,确保了片段感知且目标对齐的知识迁移。在PRVR基准测试中,KDC-Net持续优于现有方法,尤其在低片段-视频比场景下表现突出。
https://arxiv.org/abs/2603.23902
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at this https URL.
组合视频检索(CoVR)旨在根据参考视频和文本修改描述来查找目标视频。以往工作假设修改文本能完全指定视觉变化,却忽略了编辑操作所产生的后续效应与隐含后果(例如运动、状态转换、视角或时长线索等)。我们认为,成功的CoVR必须对这些后续效应进行推理。我们提出一种推理优先的零样本方法,利用大语言多模态模型:(i)推断编辑所隐含的因果与时序后果;(ii)将推理生成的查询与候选视频对齐,而无需任务特定的微调。为评估CoVR中的推理能力,我们还构建了CoVR-Reason基准,该基准为每个(参考视频、编辑描述、目标视频)三元组配对了结构化内部推理轨迹,以及需要预测后续效应而非关键词匹配的挑战性干扰项。实验表明,我们的零样本方法在K值召回率上优于强检索基线,尤其在隐含效应子集上表现突出。自动与人工分析均证实,我们检索结果具有更高的步骤一致性和效应事实性。研究结果表明,将推理融入通用多模态模型,可通过显式考虑因果与时序后续效应来实现有效的CoVR。这降低了对任务特定监督的依赖,提升了对挑战性隐含效应案例的泛化能力,并增强了检索结果的可解释性。这些成果指向了一个可扩展且原则明确的视频可解释搜索框架。相关模型、代码与基准已通过此链接公开。
https://arxiv.org/abs/2603.20190
Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on this http URL.
https://arxiv.org/abs/2603.14426
Open-domain video platforms offer rich, personalized content that could support health, caregiving, and educational applications, but their engagement-optimized recommendation algorithms can expose vulnerable users to inappropriate or harmful material. These risks are especially acute in child-directed and care settings (e.g., dementia care), where content must satisfy individualized safety constraints before being shown. We introduce SafeScreen, a safety-first video screening framework that retrieves and presents personalized video while enforcing individualized safety constraints. Rather than ranking videos by relevance or popularity, SafeScreen treats safety as a prerequisite and performs sequential approval or rejection of candidate videos through an automated pipeline. SafeScreen integrates three key components: (i) profile-driven extraction of individualized safety criteria, (ii) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (iii) LLM-based decision-making that verifies safety, appropriateness, and relevance before content exposure. This design enables explainable, real-time screening of uncurated video repositories without relying on precomputed safety labels. We evaluate SafeScreen in a dementia-care reminiscence case study using 30 synthetic patient profiles and 90 test queries. Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube's engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts.
开放域视频平台提供丰富、个性化的内容,可支持健康、护理和教育应用,但其以参与度优化为核心的推荐算法可能使脆弱用户接触到不适宜或有害内容。此类风险在儿童导向和护理场景(如痴呆症护理)中尤为突出,因为内容在展示前必须满足个性化的安全约束。我们提出了SafeScreen(安全屏幕),这是一个以安全为先决条件的视频筛查框架,可在执行个性化安全约束的同时检索并呈现视频内容。SafeScreen并非依据相关性或热度对视频排序,而是将安全视为先决条件,通过自动化流水线对候选视频进行顺序审批或拒绝。该框架整合三大核心组件:(i) 基于用户画像提取个性化安全标准;(ii) 通过自适应问题生成与多模态视频检索增强分析(VideoRAG)进行证据驱动的评估;(iii) 基于大语言模型(LLM)的决策机制,在内容展示前验证其安全性、适宜性与相关性。此设计实现了对未策展视频库的可解释、实时筛查,且无需依赖预计算的安全标签。我们在痴呆症护理怀旧疗法案例研究中评估了SafeScreen,使用30个模拟患者档案和90个测试查询。结果表明,SafeScreen优先保障安全而非参与度,在80%-93%的案例中与YouTube的参与度优化排序产生分歧,同时在高安全覆盖率、合理性与事实 groundedness 方面保持优异表现——这一结论已通过LLM评估与领域专家验证。
https://arxiv.org/abs/2604.03264
As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
随着视频内容创作转向长篇叙事,将短片组合成连贯的故事线变得越来越重要。然而,现有的检索方法在推理时仍然缺乏上下文感知能力,在优先考虑局部语义对齐的同时忽略了状态和身份的一致性。为了解决这一结构性限制,我们正式定义了“一致视频检索”(Consistent Video Retrieval, CVR)任务,并引入了一个诊断基准,涵盖YouCook2、COIN以及CrossTask数据集。我们提出了一种轻量级且即插即用的适配器CAST(Context-Aware State Transition),它可以与各种冻结的视觉-语言嵌入空间兼容。通过从视觉历史中预测条件状态下的残差更新($\Delta$),CAST 引入了对潜在状态演化的显式归纳偏置。广泛的实验表明,CAST 在 YouCook2 和 CrossTask 上提高了性能,在 COIN 上保持竞争力,并且在各种基础骨干模型上始终优于零样本基线方法。此外,CAST 还为来自 Veo 等黑盒视频生成候选者的排序提供了有用的信号,促进了更具有时间连贯性的续接。
https://arxiv.org/abs/2603.08648
We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} this https URL
我们介绍VDCook:一个自我演进的视频数据操作系统,以及为研究人员和垂直领域团队提供的可配置视频数据构建平台。用户可以通过自然语言查询和可调参数(规模、检索-合成比例、质量阈值)发起数据请求。系统自动进行查询优化,并同时运行真实视频检索与受控合成模块。最终生成具有完整来源信息和元数据的领域内数据包,以及可重现的笔记本文件。 不同于传统的静态一次性构建的数据集,VDCook通过基于MCP(模型上下文协议)\cite{mcp2024anthropic}的自动化数据摄取机制,实现了持续更新及领域的扩展,将数据集转化为动态演进的开放生态系统。此外,该系统还提供了多维度元数据标注功能(场景分割、运动评分、OCR比率、自动字幕生成等),为灵活的数据处理和索引奠定了基础\cite{vlogger}。此平台旨在通过基础设施级解决方案大幅降低构建专业视频训练数据集的技术门槛,并支持社区贡献及治理驱动的数据扩展模式。 **项目演示地址:** [this https URL]
https://arxiv.org/abs/2603.05539
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.
视频数据的多样性和规模日益增大,这要求检索系统具备多模态理解、自适应推理和领域特定知识集成的能力。本文提出了LLandMark框架,这是一个模块化的多代理框架,用于地标感知的多模态视频检索,以处理现实世界中的复杂查询。该框架包括四个阶段的专业代理:查询解析与规划、地标推理、多模态检索以及重新排序答案合成。 其中一个重要组件是地标知识代理,它可以检测文化或空间地标,并将其转化为描述性的视觉提示,从而增强CLIP(对比语言-图像预训练模型)在越南场景中的语义匹配能力。为了进一步扩展功能,我们引入了一个由大型语言模型(Gemini 2.5 Flash)支持的图片到图片处理管道,在这个过程中,大型语言模型自主地检测地标、生成图片搜索查询、检索代表性图片,并进行基于CLIP的视觉相似性匹配,从而消除了对人工输入图片的需求。此外,一个利用Gemini和LlamaIndex优化越南文本识别能力的文字光学字符识别(OCR)模块也被引入进来。 实验结果显示,LLandMark框架能够实现适应性强、文化根基深厚且具有可解释性的检索性能。
https://arxiv.org/abs/2603.02888
Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model's omni-modal embedding capacity.
多模态检索的任务是从异构模式的查询中聚合信息,以检索出所需的目标。当前最先进的多模态检索模型能够理解复杂的查询,但它们通常仅限于处理两种模式:文本和视觉。这种限制阻碍了开发能够理解结合超过两种模式的查询的通用检索系统的发展。为了朝着这一目标迈进,我们提出了OmniRet——首个能够处理跨越三种关键模式(文本、视觉和音频)的复杂组合查询的检索模型。我们的OmniRet模型解决了通用检索面临的两个重要挑战:计算效率和表示保真度。 首先,将大量来自特定模式编码器的标记序列输入大型语言模型(LLMs)在计算上是低效的。因此,我们引入了一种基于注意力机制的重采样机制来从这些序列中生成紧凑且固定大小的表示形式。其次,将丰富的跨模态数据压缩成单个嵌入向量不可避免地会导致信息丢失,并丢弃细粒度的细节。为此,我们提出了注意切片瓦瑟斯坦池化(Attention Sliced Wasserstein Pooling)以保留这些细粒度的细节,从而提高跨模态表示的质量。 OmniRet是在一个包含约600万查询-目标对的数据集汇总上进行训练的,这些数据横跨30个不同的数据集。我们在13个检索任务和MMEBv2的一个子集上基准测试了我们的模型。在组合查询、音频和视频检索任务中,我们的模型显示出了显著改进,而在其他任务上的表现与当前最先进水平相当。 此外,我们还策划了一个新的以音频为中心的多模态基准(Audio-Centric Multimodal Benchmark, ACM)。这个新基准引入了两个关键且之前缺失的任务:组合音频检索和音视频检索。这两个任务旨在更全面地评估模型在跨模式嵌入能力方面的表现。
https://arxiv.org/abs/2603.02098
We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
我们介绍了V-SONAR,这是一个从仅文本嵌入空间SONAR(Omnilingual Embeddings Team等,2026年)扩展而来的视觉-语言嵌入空间,支持1500种文字语言和177种语音语言。为了构建V-SONAR,我们提出了一种后处理对齐流程,该流程将现有视觉编码器的表示映射到SONAR空间中。我们全面评估了V-SONAR,并展示了其嵌入在文本到视频检索任务上取得了竞争性性能。配备了OMNISONAR文本解码器,V-SONAR进一步超越了最新的视觉-语言模型,在包括DREAM-1K(BLEU 23.9 vs. 19.6)和PE-VIDEO(BLEU 39.0 vs. 30.0)的视频字幕任务上。 利用V-SONAR,我们首先展示了在SONAR中运行并仅使用英语文本进行训练的大概念模型(LCM;LCM团队等,2024)可以零样本地执行单个和多个视觉概念理解。最后,我们介绍了V-LCM,它将视觉-语言指令微调扩展到了LCM中。V-LCM通过V-SONAR和SONAR将视觉和语言输入编码为统一的潜在嵌入序列,并且它的训练使用了与LCM仅文本预训练相同的潜扩散目标进行下一个嵌入预测。 在大规模多语种和多模态指令微调数据集上的实验突显了V-LCM的潜力:在涵盖图像/视频字幕和问答任务上,V-LCM的表现可媲美最新的视觉-语言模型;而在所有62种测试语言中,其在包含从丰富资源到低资源在内的61种语言中的表现显著超越。
https://arxiv.org/abs/2603.01096
Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at this https URL.
查询性能预测(QPP)是一项重要的信息检索任务,近年来受到了广泛的研究和关注。它在诸如查询重构、查询扩展以及检索系统选择等众多领域有着各种各样的应用。然而,迄今为止,大部分关于QPP的研究主要集中在文本和图像的检索上,而针对基于内容的视频检索(CBVR)中的查询性能预测问题,则几乎未被深入探索。 为此,我们提出了首个用于视频查询性能预测(VQPP)的基准测试集,该测试包括两个从文本到视频的检索数据集以及两种不同的CBVR系统。我们的VQPP包含总计56,000个文本查询和51,000段视频,并且提供官方的训练、验证及测试分组,以促进直接对比并保证研究结果可重复性。我们探索了多种预检索和后检索性能预测器,为未来在视频领域内QPP的研究提供了具有代表性的基准参考。 我们的结果显示,预检索预测器可以取得相当好的表现,这使得它们能够应用于实际的检索步骤之前的应用场景中。此外,通过使用最佳性能的预检索预测器作为直接偏好优化(DPO)奖励模型来训练大语言模型(LLM),在查询重构任务上进行学习,我们展示了VQPP的实际应用潜力。 我们的基准测试和代码已经发布在这个链接:[此URL](请将方括号中的文本替换为实际发布的网址)。
https://arxiv.org/abs/2602.17814
Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at this https URL. Demo Video: this https URL
https://arxiv.org/abs/2602.16231
Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
传统的视频检索基准主要集中在将精确描述与封闭的视频库进行匹配,未能反映开放网络中由模糊、多维度记忆特征的真实世界搜索。我们提出了**RVMS-Bench**,这是一个用于评估现实世界视频记忆搜索的全面系统。该系统包含**1,440个样本**,涵盖**20种不同的类别**和**四个持续时间组别**,所有数据均来自**真实世界的开放网络视频**。RVMS-Bench利用一个分层描述框架,包括**全局印象、关键时刻、时间背景以及听觉记忆**,以模仿现实世界中多维度的搜索线索,并通过人机交互协议对所有样本进行严格验证。 我们进一步提出了**RACLO**,这是一种代理框架,它使用溯因推理来模拟人类的“回忆-搜索-验证”认知过程,有效地解决了基于模糊记忆在现实环境中寻找视频这一挑战。实验结果显示,现有的多模态语言模型(MLLMs)在基于模糊记忆的真实世界视频检索和时刻定位方面仍表现出不足的能力。 我们认为这项工作将有助于推动在实际无结构场景中视频检索鲁棒性的进步。
https://arxiv.org/abs/2602.10159
Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.
近期的研究将生成型多模态大规模语言模型(MLLM)适应为视觉任务的嵌入抽取器,通常通过微调来产生通用表示。然而,在视频处理方面,它们的表现仍不及视频基础模型(VFMs)。本文专注于利用MLLM进行视频-文本嵌入和检索。 首先,我们进行了系统的逐层分析,显示中间层的预训练MLLM已经编码了大量与任务相关的信息。基于这一发现,我们展示了将中间层嵌入与校准后的MLLM头相结合,在无需任何训练的情况下就能实现强大的零样本检索性能。 在此基础上,我们提出了一种轻量级的文本对齐策略,该策略可以将密集视频描述映射到简短摘要,并在没有视觉监督的情况下学习任务相关的视频-文本嵌入。值得注意的是,除了文本外不进行任何微调,我们的方法超越了现有方法,在多个常见的视频检索基准测试中取得了最先进的结果。
https://arxiv.org/abs/2602.08099