We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
视觉语言模型是计算机视觉研究中的重要组成部分,然而许多高性能的模型仍然是闭源的,这使得它们的数据、设计和训练方案不为人知。为了解决这个问题,研究社区通过从黑盒模型中进行蒸馏来标注训练数据,并以此取得了强大的基准测试结果。但是,这样做是以可衡量的科学进步为代价的,因为没有教师模型及其数据来源的具体细节,科学研究的进步仍然难以衡量。 在本文中,我们致力于在一个完全开放且可重复的框架内构建感知语言模型(PLM),以促进图像和视频理解方面的透明研究。我们在不使用专有模型蒸馏的情况下分析了标准训练流程,并探索大规模合成数据来识别详细视频理解中的关键数据缺口。为了填补这些缺口,我们发布了一个包含280万个由人工标注的精细粒度视频问答对及时空定位视频字幕的数据集。 此外,我们还推出了PLM-VideoBench,这是一个专注于评测复杂视频理解任务(特别是关于“什么”、“哪里”、“何时”和“如何”的推理能力)的一套评估工具。为了确保研究工作的可重复性,我们公开了数据、训练方案、代码及模型。
https://arxiv.org/abs/2504.13180
Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
跨模态检索(CMR)是多媒体研究中的一个基本任务,旨在通过不同模态检索语义相关的项目。传统方法通常通过基于嵌入的相似性计算将文本和图像进行匹配,而最近在预训练生成模型方面的进展则推动了生成式检索作为一种有潜力的替代方案的发展。这种范例为每个目标分配唯一标识符,并利用生成模型直接预测与输入查询对应的标识符,无需显式的索引操作。尽管这种方法有很大的潜力,但目前的生成式跨模态检索方法在标识符构建和生成过程中的语义信息仍然不足。 为了克服这些局限性,我们提出了一种新的统一框架——增强语义的跨模态生成检索(Semantic-enhanced generative Cross-mOdal REtrieval,简称SemCORE),旨在释放生成式跨模态检索任务中对语义理解的能力。具体来说,首先构建了一个结构化的自然语言标识符(SID),该标识符能够有效地将目标标识符与优化为理解和生成自然语言的模型进行对齐。此外,我们引入了一种生成语义验证(GSV)策略,以实现细粒度的目标识别。 据我们所知,SemCORE是第一个同时考虑文本到图像和图像到文本检索任务的生成式跨模态检索框架。广泛的实验表明,我们的框架在最先进的生成式跨模态检索方法中表现出色,在基准数据集上实现了显著改进,特别是在文本到图像检索中的Recall@1指标平均提高了8.65个百分点。
https://arxiv.org/abs/2504.13172
In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.
近年来,视觉-语言模型的预训练领域取得了迅速进展,这一进步主要得益于大型语言模型文本能力的持续提升。然而,现有的多模态大规模语言模型培训范式严重依赖高质量的图文对数据集。随着模型和数据规模呈指数级增长,这种精心策划的数据变得越来越稀缺且饱和,从而极大地限制了该领域的进一步发展。本研究探讨了视觉-语言模型预训练的大规模描述生成技术,并证明大规模低幻觉合成描述可以实现双重用途:1)作为现实世界数据在预训练范式中的可行替代方案;2)当集成到视觉-语言模型中时,通过实证验证可显著提升性能。本文贡献了以下三点: 1. 一种创新的流水线方法用于生成高质量、低幻觉且富含知识的合成描述。我们的持续DPO(Deliberative Posterior Optimization)方法在减少幻觉方面取得了显著成果。具体来说,在保留测试集上,对于70亿参数规模模型,非幻觉性描述的比例从48.2%提升到了77.9%。 2. 全面的实证验证显示,我们生成的合成描述相对于现有数据具有更优越的预训练优势。在35个视觉-语言任务中,使用我们的数据集进行训练的模型相比于alt-text对和其他先前工作,在所有任务上至少实现了6.2%的性能提升。此外,它也在文本到图像领域提供了显著的支持。借助我们提供的数据集,在真实世界的验证基准上FID(Frechet Inception Distance)得分减少了17.1,在MSCOCO验证基准上的得分则降低了13.3。 3. 我们将发布Hunyuan-Recap100M,这是一个低幻觉且富含知识的合成描述数据集。
https://arxiv.org/abs/2504.13123
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at this https URL.
基于大型语言模型(LLMs)构建的大型视频模型(LVMs)在视频理解方面显示出巨大潜力,但常常存在与人类直觉不符以及视频幻觉等问题。为解决这些问题,我们提出了VistaDPO——一个用于视频层次化时空直接偏好优化的新框架。VistaDPO通过三个层级增强了文本和视频之间偏好的对齐: 1. 实例层级:将整体视频内容与响应进行对齐; 2. 时间层级:将视频的时间语义与事件描述相匹配; 3. 感知层级:使空间对象与语言标记相对齐。 由于缺乏用于细粒度视频-语言偏好对齐的数据集,我们构建了VistaDPO-7k数据集,其中包含7,200组问题答案(QA),附有被选择和拒绝的响应注释以及时间、关键帧和边界框等空间-时间定位信息。 在诸如视频幻觉、视频问答及描述生成等基准测试中进行的广泛实验表明,VistaDPO显著提高了现有LVMs的表现,并有效减少了视频与语言之间的对齐问题和幻觉现象。代码和数据可在提供的链接获取。
https://arxiv.org/abs/2504.13122
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at this https URL.
近期在视频生成领域的进展主要得益于扩散模型和自回归框架,但仍然存在一些关键挑战:如何在遵循提示、视觉质量、运动动态以及时长之间取得平衡。为了提高时间上的视觉质量而妥协了运动动态,受限的视频长度(5-10秒)优先考虑分辨率,以及缺乏基于镜头意识生成的能力,因为通用多模态大规模语言模型(MLLMs)无法理解电影语法,如镜头构图、演员表情和摄像机移动。这些相互交织的局限性阻碍了现实主义长篇合成和专业电影风格生成的发展。 为了克服这些限制,我们提出了SkyReels-V2,这是一种无限长度电影生成模型,该模型结合了多模态大规模语言模型(MLLM)、多阶段预训练、强化学习和扩散强迫框架。首先,我们设计了一种综合的视频结构表示方法,该方法将多模态LLM提供的通用描述与子专家模型的详细镜头语言相结合。借助人工注释,我们随后训练了一个统一的视频说明器SkyCaptioner-V1,用于有效地标记视频数据。 其次,我们为基本的视频生成建立了逐步分辨率预训练,并通过四个阶段的后训练增强:初始概念平衡监督微调(SFT)改善了基准质量;带有手工标注和合成失真数据的运动特定强化学习(RL)训练解决了动态伪影问题;我们的扩散强迫框架结合非递减噪声时间表,能够在一个高效的搜索空间中进行长视频合成;最后高质量的SFT增强了视觉保真度。 所有代码和模型都可在以下链接获取:[此URL](https://this https URL "提供正确的GitHub或项目网站链接")。请注意,在引用具体网址时,请替换“此 https URL”为实际提供的项目地址。
https://arxiv.org/abs/2504.13074
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
在检索系统中,同时实现搜索准确性和效率是非常具有挑战性的。这一问题尤其体现在部分相关视频检索(PRVR)中,在此场景下,为了提升准确性而加入更多样化的上下文表示会随时间尺度变化增加计算和内存成本。为了解决这一矛盾,我们提出了一种原型式的PRVR框架,该框架将视频中的多样化背景信息编码为固定数量的原型。接下来,我们引入了几种策略来增强文本关联及对原型中视频的理解,并设置了一个正交目标以确保这些原型能够捕捉到内容的多样性。为了在通过文字查询搜索时保持原型可查找性的同时准确地编码视频上下文,我们实施了跨模态和单模态重构任务。其中,跨模态重构任务将原型与文本特征对齐于共享空间中,而单模态重构任务则保留所有视频背景信息的完整性。此外,我们还采用了一种视频混合技术来提供弱引导以进一步对齐原型与其相关的文本表示。我们在TVR、ActivityNet-Captions和QVHighlights数据集上的广泛评估验证了我们的方法的有效性,并且在效率方面没有牺牲性能。
https://arxiv.org/abs/2504.13035
Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at this https URL.
近年来,人工智能(AI)的进步展示了其在解决许多感知和生成任务方面的能力,如图像分类和文本写作,但推理仍然是一个挑战。本文介绍了FLIP数据集,这是一个基于Idena区块链上的人类验证任务来评估AI推理能力的基准。FLIP挑战要求用户根据逻辑连贯性从两个4幅图片顺序中选择正确的序列。通过强调序列表述、视觉叙事以及常识判断,FLIP为多模态AI系统提供了一个独特的测试平台。我们的实验评估了最前沿的模型,包括视觉语言模型(VLMs)和大型语言模型(LLMs)。结果显示,在零样本设置下,即使是最先进的开源和闭源模型也只能分别达到75.5%和77.9%的最大准确率,而人类的表现则是95.3%。通过为图像提供文本描述,标题生成模型可以帮助推理模型进行更好的推理,相比直接使用原始图片,这种方法可以提高Gemini 1.5 Pro的准确度至69.6%对比75.2%。将15个模型预测结果结合起来形成一个集合,可以使准确率增加到85.2%。这些发现强调了现有推理模型的局限性以及像FLIP这样的稳健多模态基准的重要性。完整的代码库和数据集将在[该链接](https://this-url.com)提供。
https://arxiv.org/abs/2504.12256
Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie this http URL identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at this https URL .
电影音频描述(AD)旨在通过叙述视觉内容来帮助盲人和视力受损者(BVI)在无对话的片段中更好地理解画面。相比一般的视频字幕,AD需要提供与剧情相关且明确指明角色名称的叙述,这为电影带来了独特的挑战。 为了识别活跃的主要角色并专注于与剧情相关的区域,我们提出了FocusedAD这一创新框架,它提供了以人物为中心的电影音频描述。该框架包括以下部分: (i) 角色感知模块(CPM),用于跟踪角色所在的画面区域,并将其链接到对应的名字; (ii) 动态先验模块(DPM),通过可学习的软提示从之前的AD和字幕中注入上下文线索; (iii) 集中描述模块(FCM),生成包含与剧情相关的细节及命名人物的叙述。 为了克服角色识别上的局限性,我们还引入了一种自动化流程来构建角色查询库。 在多个基准测试上,包括在MAD-eval-Named和新提出的Cinepile-AD数据集中的零样本结果中,FocusedAD达到了最先进的性能水平。代码和数据将在该链接(假设提供了一个URL)发布。
https://arxiv.org/abs/2504.12157
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
视觉关系检测(VRD)的目标是识别图像中对象对之间的关系(或互动)。尽管最近的VRD模型已经取得了令人印象深刻的表现,但它们都局限于预定义的关系类别,并未能考虑视觉关系所具有的语义模糊性特征。与物体不同,视觉关系的外观总是微妙的,可以从不同的视角用多个谓词单词来描述,例如,“骑”可以分别从体育和空间位置的角度描绘为“比赛”和“坐在上面”。为此,我们提出将视觉关系建模为连续嵌入,并设计扩散模型以在条件生成方式下实现泛化的VRD,命名为Diff-VRD。我们在隐式空间中对扩散过程进行建模,并生成图像中的所有可能的关系作为嵌入序列。在生成过程中,主体-客体对的视觉和文本嵌入充当条件信号并通过交叉注意力机制注入其中。生成之后,我们设计了一个后续匹配阶段以根据它们的语义相似性将关系词分配给主体-客体对。得益于基于扩散的生成过程,我们的Diff-VRD能够生成超出数据集预定义类别标签的视觉关系。为了适当评估这项泛化的VRD任务,我们引入了两个评价指标,即文本到图像检索和灵感来自图像字幕的SPICE PR曲线。在人类对象交互(HOI)检测和场景图生成(SGG)基准测试中的广泛实验证明了Diff-VRD的优越性和有效性。
https://arxiv.org/abs/2504.12100
Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.
尽管大型视频语言模型(LVLM)在最近取得了进展,它们仍然难以理解视频中的细微时间信息、会出现幻觉现象,并且在简单的视频问答任务中也会犯一些基本错误。所有这些都对它们在实际应用中的安全和可靠部署构成了重大挑战。为了解决这些问题,我们提出了一种自我校准框架,该框架可以让LVLM从自身错误中学习。 我们的方法首先获得一组优选响应与非优选响应的训练集,其中非优选响应是通过加入常见的错误模式生成的,这些错误通常由于时空理解不足、共同出现概念之间的虚假相关性以及过分依赖语言线索而忽视视觉模态等原因产生。为了使LVLM能够与构建出的优选和非优选响应对进行自我校准,我们提出了一种新颖的偏好优化方法——精炼正则化偏好优化(RRPO),它利用子序列级别的精细奖励及逐令牌的KL正则化来解决直接偏好优化(DPO)的局限性。实验表明,与DPO相比,RRPO能够实现更精确的校准和更加稳定的训练过程。 我们的实验证明了该方法在视频幻觉、短/长视频理解以及细微时间推理等多样化视频任务中的有效性。
https://arxiv.org/abs/2504.12083
The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.
放射科报告的自动生成功能已作为一种有前景的解决方案出现,旨在减少耗时的任务并准确捕捉X光图像中的关键疾病相关发现。先前用于生成放射科报告的方法已经展示了令人印象深刻的表现。然而,通过确保检索到的报告包含与X射线图像中相似的关键疾病相关信息,并改进生成的报告来提高准确性仍有很大的提升空间。 在这项研究中,我们提出了一种基于疾病感知的图像-文本对齐和自我校正再对齐以生成可信放射科报告(DART)框架。在第一阶段,通过将含有疾病的匹配信息的图像和文本嵌入到共享的嵌入空间中并进行对比学习,我们根据图像到文本的检索方法生成初步报告。这种方法确保了所检索到的报告中的疾病相关信息与输入X射线图像紧密相关。 在第二阶段,为了进一步改进初始报告的质量,我们引入了一个自我校正模块,该模块将初步报告重新对齐至X射线图像,从而增强其准确性。我们的框架在这两项广泛使用的数据集上取得了最先进的结果,在报告生成和临床疗效指标方面均超过了以前的方法,从而提高了放射科报告的可信度。
https://arxiv.org/abs/2504.11786
While image captioning has gained significant attention, the potential of captioning time-series images, prevalent in areas like finance and healthcare, remains largely untapped. Existing time-series captioning methods typically offer generic, domain-agnostic descriptions of time-series shapes and struggle to adapt to new domains without substantial retraining. To address these limitations, we introduce TADACap, a retrieval-based framework to generate domain-aware captions for time-series images, capable of adapting to new domains without retraining. Building on TADACap, we propose a novel retrieval strategy that retrieves diverse image-caption pairs from a target domain database, namely TADACap-diverse. We benchmarked TADACap-diverse against state-of-the-art methods and ablation variants. TADACap-diverse demonstrates comparable semantic accuracy while requiring significantly less annotation effort.
尽管图像描述技术已经获得了广泛关注,但时序图像(如金融和医疗领域常用的)的描述潜力尚未被充分开发。现有的时序图像描述方法通常只能提供通用、不针对特定领域的时序形状描述,并且难以在没有大量再训练的情况下适应新的领域。为了解决这些问题,我们引入了TADACap,这是一个基于检索的框架,能够生成适合于新领域的时序图像描述而无需重新训练。在此基础上,我们提出了一种新颖的检索策略,可以从目标数据库中检索多样化的图-文对,称为TADACap-diverse。我们将TADACap-diverse与最新的方法和消融变体进行了基准测试。结果显示,TADACap-diverse在语义准确性方面表现出色,并且需要显著较少的标注工作量。
https://arxiv.org/abs/2504.11441
We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
我们介绍Seedream 3.0,这是一个高性能的中英文双语图像生成基础模型。为了应对Seedream 2.0中存在的挑战,如复杂指令对齐、精细排版生成、视觉美感和清晰度欠佳以及图片分辨率有限等问题,我们在技术上进行了改进。具体来说,Seedream 3.0 的进步来自于整个流程的全面优化,从数据构建到模型部署。在数据层面,我们通过缺陷感知训练范式和双轴协作的数据采样框架将数据集翻倍。此外,在预训练阶段,我们采用混合分辨率训练、跨模态RoPE(旋转位置编码)、表示对齐损失以及分辨率为意识的时间步采样等多种有效技术。 在后训练阶段,我们在SFT中使用多样化的美学描述,并应用基于视觉语言模型的可扩展奖励模型,从而生成与人类偏好高度一致的结果。此外,Seedream 3.0 还开创了一种新的加速范式:通过采用一致的噪声预期和重要性感知的时间步采样,在保持图像质量的同时实现了4到8倍的速度提升。 对比Seedream 2.0,Seedream 3.0 显示出显著的进步:它提高了整体能力,特别是在复杂中文字符的文本渲染方面(这对专业排版生成至关重要)。此外,它提供了原生高分辨率输出(最高可达2K),从而能够生成视觉质量极高的图像。
https://arxiv.org/abs/2504.11346
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: this https URL.
该报告提供了关于2025年CVPR会议上举办的第四届Pixel-level Video Understanding in the Wild (PVUW) 挑战赛的全面概述。它总结了挑战赛的结果、参赛方法以及未来的研究方向。本次挑战包括两个赛道:MOSE专注于复杂场景视频物体分割,MeViS则针对基于运动引导和语言基础的视频分割。这两个赛道引入了新的更为具有挑战性的数据集,旨在更好地反映现实世界的场景。通过详细的评估与分析,该挑战赛为复杂的视频分割领域的当前最新技术状态及新兴趋势提供了有价值的见解。更多相关信息可在研讨会网站上找到:[此网址](请将“this https URL”替换为您实际的链接)。
https://arxiv.org/abs/2504.11326
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
视频内容的指数级增长给高效导航、搜索和检索带来了重大挑战,因此需要先进的视频摘要技术。现有的视频摘要方法主要依赖于视觉特征和时间动态特性,往往无法捕捉到视频内容的语义信息,导致生成的摘要不完整或缺乏连贯性。为了应对这一挑战,我们提出了一种新的基于最新大型语言模型(LLMs)能力的视频摘要框架,期望通过从海量数据中学到的知识使LLM能够以更符合多样化语义和人类判断的方式评估视频帧的重要性,从而有效解决关键帧定义的内在主观性问题。我们的方法被称为“基于大型语言模型的视频总结”(LLMVS),该方法首先使用多模态大型语言模型(M-LLM)将视频帧转换为一系列描述文本,然后根据这些描述文本在局部上下文中的表现来评估每个帧的重要性。随后通过在整个视频描述文本的上下文中采用全局注意力机制对这些局部重要性评分进行优化处理,确保生成的摘要既能反映细节又能呈现整体叙事脉络。实验结果表明,在标准基准测试中我们提出的方法优于现有方法,突显了LLMs在多媒体内容处理中的潜力。
https://arxiv.org/abs/2504.11199
Although fully-supervised oriented object detection has made significant progress in multimodal remote sensing image understanding, it comes at the cost of labor-intensive annotation. Recent studies have explored weakly and semi-supervised learning to alleviate this burden. However, these methods overlook the difficulties posed by dense annotations in complex remote sensing scenes. In this paper, we introduce a novel setting called sparsely annotated oriented object detection (SAOOD), which only labels partial instances, and propose a solution to address its challenges. Specifically, we focus on two key issues in the setting: (1) sparse labeling leading to overfitting on limited foreground representations, and (2) unlabeled objects (false negatives) confusing feature learning. To this end, we propose the S$^2$Teacher, a novel method that progressively mines pseudo-labels for unlabeled objects, from easy to hard, to enhance foreground representations. Additionally, it reweights the loss of unlabeled objects to mitigate their impact during training. Extensive experiments demonstrate that S$^2$Teacher not only significantly improves detector performance across different sparse annotation levels but also achieves near-fully-supervised performance on the DOTA dataset with only 10% annotation instances, effectively balancing detection accuracy with annotation efficiency. The code will be public.
尽管完全监督的定向目标检测在多模态遥感图像理解方面取得了显著进展,但它也带来了劳动密集型标注的成本。最近的研究探索了弱监督和半监督学习来减轻这一负担。然而,这些方法忽视了复杂遥感场景中密集标注所带来的困难。在这篇论文中,我们引入了一个新的设置,称为稀疏标注定向目标检测(SAOOD),其中只对部分实例进行标记,并提出了一种解决方案以应对该设置的挑战。具体来说,我们关注这一设定中的两个关键问题:(1) 稀疏标签导致在有限前景表示上的过拟合,以及 (2) 未标记的目标(假阴性)混淆了特征学习。为此,我们提出了S$^2$Teacher,一种新颖的方法,它逐步挖掘未标记对象的伪标签,从简单到困难,以增强前景表示。此外,它还重新加权未标注目标的损失,从而在训练过程中减轻它们的影响。 广泛的实验表明,S$^2$Teacher不仅显著提高了不同稀疏标注水平下的检测器性能,并且在仅用10%注释实例的情况下,在DOTA数据集上实现了接近全监督的表现,有效地平衡了检测准确性和标注效率。代码将公开发布。
https://arxiv.org/abs/2504.11111
Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.
组成图像检索(CIR)使用一种多模态查询来检索目标图像,该查询结合了一个参考图像和描述所需修改的文本。主要挑战在于有效地融合这种视觉和文本信息。当前用于CIR的跨模式特征融合方法在意图解释方面存在固有的偏见。这些方法倾向于过分强调参考图像特征(以视觉为主导的融合)或通过图像到文本转换来强调文本修改意图(以文本为主导的融合)。这种不平衡的表示通常无法准确捕捉并反映用户搜索的真实意图。 为了解决这一挑战,我们提出了TMCIR框架,该框架通过两个关键创新推进了组成图像检索:1)意图感知跨模式对齐。首先使用包含参考图像和文本描述的伪目标图像(通过扩散模型合成),对比性地微调CLIP编码器。这一步增强了文本捕捉文本描述中细微意图的能力。2)自适应标记融合。我们进一步通过对比方式,通过将自适应标记融合特征与目标图像进行比较来微调所有编码器。这种机制在对比学习管道内动态平衡视觉和文本表示,优化用于检索的组合特征。 在Fashion-IQ和CIRR数据集上进行了广泛的实验,结果表明TMCIR显著优于现有方法,特别是在捕捉细微用户意图方面。
https://arxiv.org/abs/2504.10995
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
部分相关视频检索(PRVR)是文本到视频检索中的一项实用且具有挑战性的任务,其中的视频未经过修剪,并包含大量背景内容。该领域的目标在于寻找既有效又高效的解决方案,以捕捉文字查询与未修剪视频之间的局部对应关系。然而,现有的PRVR方法通常侧重于多尺度片段表示建模,但这些方法往往受到内容独立性和信息冗余的影响,从而损害了检索性能。 为了克服这些限制,我们提出了一种简单而有效的主动时刻发现(AMDNet)方法。我们的目标是发现与查询语义一致的视频时刻。通过使用可学习的时间间隔锚点来捕捉不同的时刻,并应用掩码多时刻注意力机制以强调显著性时刻同时抑制冗余背景,我们可以实现更为紧凑和信息丰富的视频表示。 为进一步增强时刻建模,我们引入了两种损失函数:一种是多样性损失(moment diversity loss),鼓励不同区域中不相同的时刻;另一种是相关性损失(moment relevance loss),促进与查询语义相关的时刻。这两种损失协同工作,并与部分相关检索损失共同作用于端到端优化。 在两个大规模视频数据集(即TVR和ActivityNet Captions)上的广泛实验表明,我们的AMDNet方法具有优越性和效率。具体而言,在TVR数据集中,AMDNet比当前最先进的GMMFormer方法小约15.5倍(参数数量),同时性能高出6.0分(SumR)。
https://arxiv.org/abs/2504.10920
In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.
在自动驾驶领域,正确解读交通手势(TG)至关重要。这些手势可能来自指挥交通的执法人员或行人向驾驶员发出信号的情况,以确保所有道路使用者的安全和顺畅环境。本研究探讨了目前最先进的视觉-语言模型(VLMs)在零样本解释中的能力,重点关注其对交通环境中人类手势进行描述和分类的能力。我们创建并公开分享了两个定制数据集,包含不同形式的正式和非正式TG,例如“停止”、“倒车”、“招手”等。这些数据集分别是“表演TG(ATG)”和“实际指示TG(ITGI)”,它们用自然语言注释,描述行人的身体位置和手势。 我们采用三种方法评估模型性能,并以专家生成的说明作为基准和控制: 1. 句子相似度 2. 手势分类 3. 姿态序列重建相似度 结果显示:目前的VLM在理解手势方面存在困难。句子相似性的平均值低于0.59,分类F1分数仅为0.14-0.39,远低于专家基准的0.70。虽然姿态重构显示出潜力,但需要更多的数据和更精细的度量标准才能变得可靠。 我们的研究结果表明,尽管一些最先进的VLM能够解释零样本的人类交通手势,但是没有一种模型在准确性和可靠性方面足够强大以确保信任,这突显了该领域进一步研究的需求。
https://arxiv.org/abs/2504.10873