With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
随着大型多模态模型的快速发展,可靠的评判和批评模型对于开放式的评估及偏好对齐变得至关重要。这些模型能够提供成对的偏好、数值评分以及解释性的理由,用于评估由模型生成的回答。然而,现有的批评模型主要是在诸如图像描述或基于图像的问题回答等通用视觉领域进行训练,而涉及感知、因果推理和规划的物理AI任务则鲜有探索。 我们引入了PhyCritic,这是一种通过两阶段RLVR(Reinforcement Learning from Visual Representations)管道优化后的多模态批评模型,专门针对物理AI。该管道包括两个关键阶段:首先是物理技能预热阶段,增强与物理学相关的感知和推理能力;其次是自我参照性批评微调阶段,在此阶段中,批评模型在其评判候选回答之前先生成自己的预测作为内部参考,以提高判断的稳定性和物理正确性。 在涉及物理及通用目的多模态裁判基准测试中,PhyCritic均取得了相对于开源基线显著的性能提升。当将PhyCritic用作策略模型时,在基于物理学的任务上进一步提升了感知和推理能力。
https://arxiv.org/abs/2602.11124
Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: this https URL.
最近,视频大型语言模型(Video Large Language Models, VideoLLM)在视频理解任务中表现出色。然而,我们发现了一个此前未被充分探索的生成失败模式:严重的输出重复现象,即模型陷入自我强化的循环,不断重复相同的短语或句子。现有的VideoLLM基准测试主要关注任务准确性和事实正确性,并没有捕捉到这种失效模式。为此,我们提出了VideoSTF(Video Stability Test Framework),这是第一个系统地衡量和压力测试视频生成中输出重复现象的框架。VideoSTF使用三种互补的n-gram基元度量方法来正式定义重复问题,并提供了一个包含10,000个多样化视频的标准测试平台,以及一个控制时间转换的库。 通过利用VideoSTF,我们对十个高级VideoLLM进行了广泛的测试、时序压力测试和对抗性利用。我们的研究发现,输出重复现象普遍存在,并且关键地,它对于输入视频的时间扰动非常敏感。此外,我们还展示了在黑盒设置中,简单的时间转换可以高效地诱导模型出现重复性的退化行为,揭示了这种输出重复模式作为可被利用的安全漏洞。 这些结果表明,输出重复是现代VideoLLM中的一个根本性稳定性问题,并激发了视频-语言系统中以稳定性能为导向的评估需求。我们的评测代码和脚本可在以下链接获取:this https URL.
https://arxiv.org/abs/2602.10639
Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
验证声明的真实性通常需要对文本和视觉证据进行多模态推理,例如分析文字描述和图表图像以验证声明。此外,为了使推理过程透明化,还需要通过文本解释来证明验证结果的合理性。然而,大多数声明验证工作主要集中在仅基于文本证据的推理上,或者忽略了可解释性,这导致了验证结果的不准确和缺乏说服力。为了解决这一问题,我们提出了一种新型模型,该模型可以同时实现证据检索、多模态声明验证以及生成解释。 在证据检索方面,我们构建了一个双层的多模态图来关联声明与证据,其中设计了图像到文本和文本到图像的推理机制来进行多模态检索。对于声明验证,我们提出了令牌级和证据级融合方法,以整合声明和证据的嵌入特征进行多模态验证。在解释生成方面,我们引入了Decoder中的多模态Fusion-in-Decoder来增强可解释性。 最后,鉴于几乎所有数据集都属于通用领域,我们在人工智能领域创建了一个新的科学数据集AIChartClaim,以此来补充和完善声明验证社区的需求。实验结果展示了我们的模型的优势和有效性。
https://arxiv.org/abs/2602.10023
Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
在长时间未剪辑的视频中定位和分类精细粒度的任务子段对于安全的人机协作至关重要。与通用活动识别不同,协作操作需要可以直接由机器人执行的任务子标签。我们提出了一种多阶段的人类到机器人的任务分割框架RoboSubtaskNet,该框架结合了增强注意力机制的I3D特征(RGB和光流)以及采用斐波那契膨胀时间表修改后的MS-TCN架构,以更好地捕捉如“伸手-抓取-放置”这类短时域内的转换。网络通过一个包含交叉熵和时间正则化器(截断MSE和过渡感知项)的复合目标进行训练,以减少过度分割并鼓励有效的子任务进展。 为了弥合视觉基准与控制之间的差距,我们引入了RoboSubtask数据集,该数据集包含了医疗保健和工业演示,并在子任务级别进行了注释,旨在确定性地映射到机械臂的基本操作。经验表明,在GTEA和我们的RoboSubtask基准测试(边界敏感性和序列度量)上,RoboSubtaskNet的表现优于MS-TCN和MS-TCN++,同时在长时域的Breakfast数据集上也具有竞争力。具体而言,RoboSubtaskNet在GTEA上的F1@50 = 79.5%,Edit = 88.6%,Acc = 78.9%;在Breakfast上的F1@50 = 30.4%,Edit = 52.0%,Acc = 53.5%;以及在RoboSubtask上的F1@50 = 94.2%,Edit = 95.6%,Acc = 92.2%。我们进一步在一个7自由度的Kinova Gen3机械臂上验证了整个感知到执行的管道,在物理试验中实现了可靠的整体端到端行为(总体任务成功率约为91.25%)。这些结果展示了从子任务级别视频理解到现实世界环境中部署机器人操作的实际路径。 上述内容翻译自原文,详细描述了RoboSubtaskNet框架及其在不同数据集上的性能表现。
https://arxiv.org/abs/2602.10015
Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($\rho = 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
最近的研究表明,文本到图像的模型在生成具有地理代表性的图片时常常失败,这引发了对其训练数据代表性问题的关注,并提出了一个问题:这些训练样本来自世界上的哪些地区?我们通过使用大规模多模态数据集(如Re-LAION、DataComp1B和Conceptual Captions)中的英语描述文本,基于从描述中提取的位置信息将图像-描述对映射到国家的方式来进行地理分析。通过对20个常见实体(例如房屋、国旗)进行研究,我们发现美国、英国和加拿大占了48.0%的样本,而南美和非洲国家的代表性严重不足,分别仅占1.8%和3.8%的图片。我们观察到一个国家的国内生产总值与其在数据中的表示之间存在强烈的正相关关系($\rho = 0.82$)。另外,我们还检查了Re-LAION数据集四种语言非英语子集中图像的代表性情况,发现这些图像主要集中在那些语言的主要使用国中。此外,我们发现较高的代表性并不一定意味着视觉或语义多样性更高。最后,在分析Stable Diffusion v1.3在Re-LAION上训练生成的国家特定图片时,虽然生成的图片看起来很逼真,但与现实世界的图片相比,它们的覆盖范围受到了严重限制。
https://arxiv.org/abs/2602.09775
Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
三维配装资产是三维变形和动画的基础。然而,现有的三维生成方法在生成可动画化的几何体方面面临挑战,而装配技术缺乏对骨骼创建的精细结构控制。为了解决这些限制,我们引入了Stroke3D,这是一种新颖的框架,可以直接根据用户输入(包括二维绘制线条和描述性文本提示)生成配装网格。 我们的方法开创了一种两阶段流程,将生成过程分为以下两个部分: 1. **可控骨架生成**:我们使用骨骼图变分自动编码器(Skeletal Graph VAE, Sk-VAE)将骨架的图形结构编码到潜在空间中,在那里通过骨骼图DiT(Sk-DiT)生成骨骼嵌入。生成过程中,既以文本为基础进行语义条件设定,又基于二维线条进行显式的结构控制,并利用自动解码器重构最终高质量的三维骨架。 2. **增强网格合成**:在这一阶段,我们使用TextuRig数据集(Objaverse-XL中具有纹理和装配信息并配有描述的文字的网格集合)来增强现有的骨骼到网格模型的训练数据。此外,我们采用一种偏好优化策略SKA-DPO,在该策略中基于骨架-网格对齐评分进一步提升几何准确性。 总的来说,我们的框架为创建可立即用于动画化的三维内容提供了一种更为直观的工作流程。据我们所知,这是首次生成基于用户绘制二维线条的装配好的三维网格的工作。广泛的实验表明,Stroke3D能够产生合理的骨骼和高质量的网格。
https://arxiv.org/abs/2602.09713
The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
在线视频中仇恨内容的泛滥对个人福祉和社会和谐构成了严重的威胁。然而,现有的视频仇恨检测解决方案要么严重依赖大规模的人工标注,要么缺乏精细的时间精度。在本工作中,我们提出了LELA,这是一个无需训练的大型语言模型(LLM)框架,用于定位仇恨视频中的仇恨内容。与当前最先进的依赖监督管道的模型不同,LELA利用LLM和特定模态的字幕,在不需要训练的情况下检测并精确定位时间上的仇恨内容。我们的方法将一个视频分解为五个模态,包括图像、语音、OCR(光学字符识别)、音乐和视频上下文,并采用多阶段提示方案计算每个帧的精细仇恨评分。我们进一步引入了一种组合匹配机制,以增强跨模式推理能力。 在两个具有挑战性的基准数据集HateMM和MultiHateClip上的实验表明,LELA大幅度超越了所有现有的无需训练的基础模型。我们还提供了广泛的消融研究和定性可视化,确立了LELA作为可扩展且易于解释的仇恨视频定位的强大基础。
https://arxiv.org/abs/2602.09637
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
我们介绍了MILE-RefHumEval,这是一个无需真实标注或评估者协调的大型语言模型(LLMs)无参考评估框架。该框架利用一组由人类一致性的模式指导的独立提示评估器,支持离散和连续评分判断。通过从最佳候选选择、摘要生成到图像描述和对话等特定任务的提示,MILE-RefHumEval 提供了灵活、可解释且可扩展的评估方式。实验表明,该框架与人类判断高度一致,优于先前的方法,并减少了计算开销,为现实世界中LLMs 的评估提供了一个高效、稳健且符合人类标准的解决方案。
https://arxiv.org/abs/2602.09624
The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
高质量数据的稀缺仍然是将多模态生成模型应用于医学图像编辑的主要瓶颈。现有的医学图像编辑数据集通常存在多样性不足、忽视了对医学图像的理解以及无法平衡质量和可扩展性的问题。为了解决这些问题,我们提出了MieDB-100k,这是一个大规模、高质量且多样化的文本引导医学图像编辑的数据集。该数据集将编辑任务分为感知、修改和转换三个视角,并考虑了理解和生成的能力。我们通过利用特定模态的专家模型以及基于规则的数据合成方法构建了MieDB-100k,并进行了严格的手动检查以确保临床准确性。广泛的实验表明,使用MieDB-100k训练的模型在性能上不仅超越了开源和专有的模型,还展示了强大的泛化能力。我们预计该数据集将成为未来医学图像编辑领域进展的重要基石。
https://arxiv.org/abs/2602.09587
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
基于语义动作检索视频是一项基本但尚未解决的问题。现有的视频表示方法过于依赖静态外观和场景上下文,而忽视了动态动作,这种偏见源自它们的训练数据和目标设定。相比之下,传统的以动作为中心的输入(如光流)缺乏理解高层次动作所需的语义背景。为了展示这一内在偏见,我们引入了SimMotion基准测试集,它结合了受控合成数据与新的、由人类标注的真实世界数据集。我们发现现有模型在这些基准上表现不佳,经常无法将运动和外观分开。 为了解决这个差距,我们提出了SemanticMoments,这是一种简单且无需训练的方法,通过对预训练语义模型的特征计算时间统计(特别是高阶矩)。在我们的基准测试中,SemanticMoments始终优于现有的RGB、光流以及文本监督方法。这表明,在语义特征空间中的时间统计数据为以动作为中心的视频理解提供了一个可扩展且感知基础良好的框架。
https://arxiv.org/abs/2602.09146
Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools. The testbed is available at this https URL\_VideoSensemaking
地面机器人车队的视频可以通过提供可扩展的情境感知和减轻专业人员负担来提升公共安全。然而,关于如何设计并整合多机器人视频到公共安全管理流程中的知识还很少。与六家警察机构合作,我们研究了此类视频如何变得实用。 在第一项研究中,我们展示了第一个用于多机器人地面视频理解的测试平台。该测试平台包括38个与公共安全相关的事件(EoI),以及一个包含20段巡逻机器人视频的数据集(10组昼夜视频),这些数据涵盖了各种类型的事件,并确定了6个旨在改进当前视频理解和分析实践的设计要求。 在第二项研究中,我们构建了MRVS工具,该工具使用经过提示工程优化的视频理解模型来增强多机器人的巡逻视频流。参与者报告说,基于LLM(大型语言模型)解释的帮助减少了手动工作量,并且提高了他们的信心,同时他们也表达了对误报和隐私问题的关注。 我们的研究总结指出,在设计未来的多机器人视频理解和分析工具时,应当考虑这些发现的启示作用。测试平台可以在以下链接找到:https://this.is.URL\_VideoSensemaking (注:上述URL只是一个占位符,请用实际提供的有效网址替换)
https://arxiv.org/abs/2602.08882
With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.
随着大型语言模型(LLMs)的快速发展,视频多模态大型语言模型(Video MLLMs)在视频理解和问答等任务中取得了显著的成绩。然而,Video MLLMs面临着高昂的计算成本问题,特别是在处理大量视频帧输入时会导致显著的关注力计算开销。减少计算成本的一种直接方法是降低输入视频帧的数量。但是,简单地以固定帧率(FPS)选择关键帧往往会忽略非关键帧中的重要信息,导致性能下降。 为了应对这一挑战,我们提出了文本引导的视频帧缩减框架(Text-guided Video Frame Reduction, TiFRe),该框架在减少输入帧数量的同时保留了重要的视频信息。TiFRe采用了一种基于用户输入的文本引导的帧采样策略(TFS)来选择关键帧,并由LLM生成类似CLIP风格的提示词。预训练好的CLIP编码器计算这些提示词与每个视频帧之间的语义相似度,从而选出最相关的帧作为关键帧。 为了保持视频的语义信息不丢失,TiFRe采用了一个帧匹配和合并(Frame Matching and Merging, FMM)机制,将非关键帧的信息整合到选定的关键帧中,从而最小化信息损失。实验结果表明,TiFRe不仅能有效降低计算成本,还能提升在视频语言任务中的性能表现。
https://arxiv.org/abs/2602.08861
We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.
我们介绍了Omni-Video 2,这是一个可扩展且计算效率高的模型,它将预训练的多模态大型语言模型(MLLM)与视频扩散模型连接起来,以实现统一的视频生成和编辑。我们的核心思想是利用MLLM的理解和推理能力来产生明确的目标描述,以便解释用户指令。通过这种方式,理解模型中的丰富上下文表示可以直接用于引导生成过程,从而在复杂的组合性编辑任务中提高性能。此外,我们开发了一个轻量级适配器,将多模态条件令牌注入到预训练的文本转视频扩散模型中,以参数高效的方式最大限度地利用它们的强大生成先验知识。 得益于这些设计,我们将Omni-Video 2扩展到了一个140亿参数的视频扩散模型上,该模型是在精心策划的质量训练数据集上进行训练的,支持高质量的文本到视频生成以及各种视频编辑任务(如对象移除、添加、背景更改、复杂运动编辑等)。我们在FiVE基准测试中评估了Omni-Video 2在细粒度视频编辑方面的性能,在VBench基准测试中评估了其在文本转视频生成方面的性能。结果表明,它在遵循复杂的组合指令进行视频编辑方面表现出卓越的能力,并且在视频生成任务中的质量也达到了竞争水平或更优。
https://arxiv.org/abs/2602.08820
This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at this https URL.
本文提出了一种新颖的任务——Omni Dense Captioning(全方位密集字幕生成),旨在为视频内容生成连贯、细致且结构化的音视频叙述,并明确标注时间戳。为了确保密集的语义覆盖,我们引入了一个六维结构性框架来创建“剧本式”描述,使读者能够如同电影剧本般逐场景地生动想象视频内容。 为了促进研究进展,我们构建了OmniDCBench——一个高质量的人工注释基准,并提出了SodaM,这是一种统一的评估指标,用于评价带有时间感知的详细描述,并减轻场景边界模糊的问题。此外,我们还创建了一个训练数据集TimeChatCap-42K,并介绍了TimeChat-Captioner-7B,这是一个通过策略梯度强化学习(SFT)和目标奖励政策优化(GRPO),结合任务特定奖励进行训练的强大基线模型。 广泛的实验表明,TimeChat-Captioner-7B在性能上达到了最新水平,超过了Gemini-2.5-Pro,并且其生成的密集描述显著提升了下游音视频推理(DailyOmni 和 WorldSense)和时间定位(Charades-STA)任务的能力。所有数据集、模型及代码将在[此链接](https://this-url.com)公开提供。
https://arxiv.org/abs/2602.08711
Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
假设。人工智能的通用智能在根本上是一个压缩问题。有效的压缩需要共振:深度学习在架构与数据的基本结构相匹配时表现最佳。这些是基本原则。然而,现代视觉架构已经偏离了这一真理:视觉信号高度冗余,而判别信息(即惊喜)则非常稀疏。当前模型对密集的像素网格进行均匀处理,浪费了大量的计算资源来处理静态背景,而不是专注于定义运动和意义的预测残差上。我们认为,要解决视觉理解问题,我们必须使架构与视频的信息论原理——编码器-解码器系统(Codec)相一致。 方法。OneVision-Encoder 通过将预测性的视觉结构压缩为语义含义来对视频进行编码。通过采用 Codec 补丁化技术,OV-Encoder 放弃了均匀计算,并专注于信号熵丰富的区域,这些区域占总区域的3.1%-25%之间。为了在不规则令牌布局下统一空间和时间推理,OneVision-Encoder 使用共享的三维RoPE(旋转位置嵌入),并通过超过一百万个语义概念的大规模集群判别目标进行训练,同时捕捉对象持久性和运动动态。 证据。结果验证了我们的核心假设:效率与准确度并不是权衡的关系;它们是正相关的。当集成到大型语言模型中时,OneVision-Encoder 在16个图像、视频和文档理解基准测试上始终优于强大的视觉骨干网络,如Qwen3-ViT和SigLIP2,尽管使用了更少的视觉令牌和预训练数据。值得注意的是,在视频理解任务中,OV-Encoder 相对于 Qwen3-ViT 平均提高了4.1%。 Codec 对齐、补丁级稀疏性是基本原则,使 OV-Encoder 成为下一代视觉通用智能的基础引擎。
https://arxiv.org/abs/2602.08683
Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: this https URL.
视频生成正迅速发展为统一的音视频生成。在本文中,我们介绍了ALIVE模型,这是一个将预训练的文本到视频(Text-to-Video,T2V)模型适应于Sora风格的音视频生成和动画工作的生成模型。特别地,与基础的T2V模型相比,该模型解锁了文本到视频音频(Text-to-Video&Audio, T2VA)以及参考到视频音频(动画)的能力。为了支持视听同步及参考动画,我们增强了流行的MMDiT架构,加入了一个联合音视频分支,其中包括用于时间对齐的跨模态融合技术TA-CrossAttn和用于精确视听对齐的UniTemp-RoPE。 同时,设计了一套全面的数据处理流程,包括音频-视频配字幕、质量控制等步骤,以收集高质量的微调数据。此外,我们还引入了一个新的基准测试来进行全面的模型性能测试与比较。经过百万级高质数据的持续预训练和微调之后,ALIVE展现了卓越的性能,在多个方面超过了开源模型,并且达到了或超过了业内最先进的商业解决方案。 通过详细的实现步骤和基准测试标准,希望ALIVE能帮助社区更高效地开发音视频生成模型。官方页面: [提供链接]
https://arxiv.org/abs/2602.08682
Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.
流媒体视频问答(Streaming Video QA)对多模态大型语言模型(Multimodal Large Language Models,MLLMs)提出了独特的挑战。由于视频帧是按顺序到达的,并且用户可以在任意时间点提出问题,因此现有的解决方案依赖于固定大小的记忆或简单压缩方法往往会导致上下文丢失或内存溢出,在长篇幅、实时场景中效果受限。我们介绍了一个名为Vista的新框架,该框架专门用于基于场景感知的流媒体视频问答,使对连续视频流的有效和可扩展推理成为可能。 Vista 的创新之处可以总结为三个方面: 1. **基于场景的分割**:Vista 动态地将传入的帧聚类成时间和视觉上连贯的场景单元。 2. **基于场景的压缩**:每个场景都被压缩成一个紧凑的令牌表示并存储在 GPU 内存中,以实现高效的索引检索。同时,全分辨率帧被卸载到 CPU 内存。 3. **基于场景的召回**:在接受到查询时,相关场景会被选择性地召回并重新整合到模型输入中,从而保证效率和完整性。 Vista 是一种与多种视觉语言骨干网无缝集成、无需牺牲延迟或内存效率就能实现长上下文推理的方法。在 StreamingBench 上进行的广泛实验表明,Vista 达到了最先进的性能,并为实际流媒体视频理解建立了强有力的基准。
https://arxiv.org/abs/2602.08448
Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.
尽管最近的多模态大型语言模型(MLLMs)在视频理解能力方面有所提升,但现有的视频基准测试主要基于评估模型的静态、内部知识,而不是它们从少量动态新上下文中学习和适应的能力。为了填补这一空白,我们提出了由演示驱动的视频情境学习任务,该任务专注于通过在情境中的演示来学习并回答关于目标视频的问题。与此相伴,我们还提出了一项新的挑战性基准——Demo-ICL-Bench,旨在评估由演示驱动的视频情境学习能力。 Demo-ICL-Bench 由1200个带有相关问题的教育类YouTube视频构建而成,从中派生出两种类型的示范:(i) 摘要视频字幕以生成文本示例;和(ii) 对应的教程视频作为视频示例。为了有效应对这一新挑战,我们开发了Demo-ICL,这是一种具有两阶段训练策略的MLLM:视频监督微调和信息辅助直接偏好优化,共同增强了模型从情境中例子学习的能力。 与最先进的多模态大型语言模型进行广泛实验后证实了Demo-ICL-Bench 的难度,并展示了 Demo-ICL 的有效性,从而揭示了未来的研究方向。
https://arxiv.org/abs/2602.08439
E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a \textbf{multi-modal information density assessment framework} to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce \textbf{E-commerce Video Ads Benchmark (E-VAds)}, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop \textbf{E-VAds-R1}, an RL-based reasoning model featuring a multi-grained reward design called \textbf{MG-GRPO}. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.
电子商务短视频代表了在线视频行业中高收入的一部分,其特点是目标导向的格式和密集多模态信号。目前的模型在处理这些视频时常常遇到困难,因为现有的基准测试主要关注通用任务,并忽视了商业意图推理。在这项工作中,我们首先提出了一种**多模态信息密度评估框架**来量化该领域的复杂性。我们的评估表明,与主流数据集相比,电子商务内容在视觉、音频和文本模式上表现出更高的密集度,为视频理解建立了一个更具挑战性的前沿领域。 为了弥补这一差距,我们引入了**电子商务视频广告基准(E-VAds)**,这是首个专门针对电子商务短视频理解设计的基准测试。我们从淘宝收集了3,961条高质量视频,涵盖了广泛的商品类别,并使用多代理系统生成了19,785对开放式问答。这些问题按照感知和认知及推理两个主要维度组织,并包括五个不同的任务。 最后,我们开发了**E-VAds-R1**,这是一个基于强化学习(RL)的推理模型,具有称为**MG-GRPO**的多粒度奖励设计策略。这种策略为早期探索提供了平滑指导,同时创造了一个非线性的激励机制以达到专家级精度水平。实验结果表明,在只有几百个训练样本的情况下,E-VAds-R1在商业意图推理上实现了109.2%的性能提升。
https://arxiv.org/abs/2602.08355
Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.
在城市环境中学习可转移的多模态嵌入非常具有挑战性,因为对城市的理解本质上是空间性的,而现有的数据集和基准却缺乏街道视图图像与城市结构之间的显式对齐。为此,我们引入了UGData,这是一个基于位置的数据集,它将街道视图图像锚定于有结构的空间图,并通过空间推理路径和空间上下文描述提供了图形对齐的监督,这揭示了距离、方向性、连接性和邻里环境等信息,这些信息超出了单纯图片内容所包含的内容。 基于UGData,我们提出了UGE,这是一个两阶段训练策略,它逐步且稳定地将图像、文本和空间结构对齐,并结合指令引导对比学习与基于图的空间编码。最终,我们推出了UGBench,一个全面的基准测试工具,用于评估空间定位嵌入如何支持多种城市理解任务——包括地理定位排序、图片检索、城市感知及空间定位。 我们在多个最先进的视觉语言模型(VLM)骨干网络上开发了UGE,包括Qwen2-VL、Qwen2.5-VL、Phi-3-Vision和LLaVA1.6-Mistral,并通过LoRA调优训练固定维度的空间嵌入。基于Qwen2.5-VL-7B骨干网的UGE,在训练城市中实现了图像检索最高44%的改进,地理定位排序最高30%的提升;而在未参与训练的城市中也分别取得了超过30%和22%的增长,这证明了显式空间锚定对于处理密集的空间任务的有效性。
https://arxiv.org/abs/2602.08342