Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
视觉生成模型在从文本提示创建逼真的图像方面取得了显著进展,但处理包含多个对象及其精确空间关系和属性的复杂提示时仍面临挑战。有效应对这些复杂提示需要对语义内容和空间布局进行明确推理。我们提出了GoT-R1框架,该框架利用强化学习来增强视觉生成中的语义-空间推理能力。基于生成链式思维方法,GoT-R1使模型能够自主发现超越预定义模板的有效推理策略,这得益于精心设计的强化学习机制。 为了实现这一点,我们提出了一种双阶段多维度奖励框架,该框架利用大规模语言模型(MLLMs)来评估推理过程和最终输出,从而在整个生成管道中提供有效的监督。奖赏系统以统一的方式评估语义一致性、空间准确性以及视觉质量。 实验结果显示,在涉及精确空间关系和属性绑定的组合任务上,GoT-R1在T2I-CompBench基准测试中取得了显著改进,尤其是在处理复杂的组成性任务方面。GoT-R1通过成功将高级推理能力转移到图像生成领域,推动了这一领域的最新研究进展。 为了促进未来的相关研究工作,我们已在[此链接](https://此链接提供具体URL)上公开发布了代码和预训练模型。
https://arxiv.org/abs/2505.17022
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.
最近的研究表明,通过基于规则的强化学习(RL)利用结果奖励可以成功地激发多模态大型语言模型(MLLMs)的强大推理能力。然而,这种范式通常缺乏对最终答案形成过程中的思维流程进行监督,因此模型可能会学到次优的推理策略,这会妨碍其泛化能力。为此,我们提出了SophiaVL-R1,试图在这个框架中为思考过程添加奖励信号。 为了实现这一目标,我们首先训练了一个评估整个思维过程质量的思维奖励模型。鉴于某些样本中的思维奖励可能由于奖励操控而不可靠,我们提出了一种Trust-GRPO方法,在此过程中根据导致正确答案和错误答案响应之间的思维奖励比较来分配一个可信度权重。这种方法有助于减轻潜在不准确思维奖励的影响。 此外,我们设计了一个退火训练策略,该策略会随着时间的推移逐渐减少思维奖励的重要性,使得模型在后期的训练阶段能够更多地依赖于准确的结果奖励。实验结果表明,我们的SophiaVL-R1在多个基准测试(如MathVisita、MMMU)上超越了一系列推理MLLMs,展示了强大的推理和泛化能力。特别值得注意的是,在大多数基准测试中,我们的参数量较少的SophiaVL-R1-7B模型甚至超过了具有10倍更多参数的LLaVA-OneVision-72B。 所有代码、模型和数据集均已公开发布在以下链接:[此链接需要您根据原文进行填写]。
https://arxiv.org/abs/2505.17018
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
多模态大型语言模型(MLLMs)在视觉任务方面取得了迅速进展,但它们的空间理解能力仍局限于单张图像的范畴,这使得这些模型不太适合需要跨多帧进行推理的机器人技术和其他现实世界应用。为此,本文提出了一种框架,旨在通过整合深度感知、视觉对应和动态感知,增强MLLMs在处理多帧数据时的空间理解能力。 我们方法的核心是MultiSPA数据集,这是一个新颖且大规模的数据集合,包含超过2700万个样本,涵盖了各种3D和4D场景。除了MultiSPA数据集之外,我们还提出了一套全面的基准测试框架,在统一的标准下评估一系列空间任务的表现。我们的模型——多帧时空MLLM(Multi-SpatialMLLM),在与基线系统及专有系统的对比中取得了显著的进步,证明了其具备可扩展和通用性的跨多帧推理能力。 此外,我们还观察到了该模型在处理多种任务时的协同效应,并发现了它在面对挑战性场景时所展现出的新颖能力。最后,我们展示了我们的模型如何能够作为机器人技术中的多帧奖励标注器使用。
https://arxiv.org/abs/2505.17015
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
多模态大型语言模型(MLLMs)在问答任务中取得了令人印象深刻的成功,但它们的空间理解能力却鲜有探索。本研究探讨了一个关键问题:现有的MLLM是否具备三维空间感知和理解的能力?具体而言,本文作出了以下贡献: (i) 我们引入了VGBench,这是一个专门设计的基准测试工具,用于评估MLLM在视觉几何感知方面的能力,例如相机姿态和运动估计; (ii) 我们提出了SpatialScore,这是迄今为止最为全面且多样化的多模态空间理解基准测试,它将VGBench与来自其他11个现有数据集的相关数据进行了整合。该基准测试包含了28,000多个样本,涵盖了各种空间理解任务、模式以及问答格式,并包含了一个精心策划的挑战性子集SpatialScore-Hard; (iii) 我们开发了SpatialAgent,这是一个新型多代理系统,集成有9种专门的空间理解工具,支持计划-执行和反思行动(ReAct)推理范式; (iv) 我们进行了广泛评估,揭示了空间推理中持久存在的挑战,并展示了SpatialAgent的有效性。 我们认为,SpatialScore将提供宝贵的见解,并作为下一代MLLM演进的严格基准。
https://arxiv.org/abs/2505.17012
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
我们提出了一种自适应时间因果视频分词器AdapTok,它可以根据视频内容灵活地为不同帧分配标记。AdapTok装备有块级别的掩码策略,在训练过程中随机丢弃每个块的尾部标记,并且有一个块因果评分器用于预测使用不同数量令牌时视频帧的重建质量。在推理阶段,我们进一步提出了一种基于整数线性规划的自适应令牌分配策略,以根据预测得分调整令牌使用情况。这种设计允许在可控的整体预算下进行样本级、内容感知和时间动态变化的标记分配。 在UCF-101和Kinetics-600数据集上的大量实验表明了我们方法的有效性。无需额外的图像数据,在不同的令牌预算下,AdapTok能够持续提高重建质量和生成性能,从而允许更可扩展且高效的视频生成建模。
https://arxiv.org/abs/2505.17011
Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.
岩石薄片的矿物学分析对于油气储层评价是一项重要任务。然而,人工分析往往主观且耗时。虽然诸如QEMSCAN(R)等技术旨在自动化矿物学测绘过程,但它们也存在高成本和耗时的问题。本研究提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)的模型,用于自动分割碳酸盐岩薄片图像中的矿物区域,该方法以低成本、通用且高效的方式模拟了QEMSCAN制图流程。 为了实现这一目标,使用U-Net语义分割架构对平面偏光和交叉偏光下的岩石薄片图像进行训练,并将对应的QEMSCAN地图作为目标。这种基于QEMSCAN地图的训练方式在研究中不常被采用。模型被设定为区分方解石、白云石、镁质粘土矿物、石英、孔隙以及其他矿物相(标记为“Others”)。为了评估其泛化能力,该模型不仅对训练期间见过的岩相进行了验证,还测试了未见过的岩相。 由于图像和地图提供时分辨率不同,应用图像配准技术使它们在空间上对齐。研究表明,分割质量很大程度上取决于这些分辨率差异以及可学习岩石纹理的多样性。然而,研究结果显示出令人鼓舞的结果,特别是在固体纹理中矿物边界划定得当,并且能精确估计矿物分布。该模型预测与实际分布之间呈现出近似线性关系,对于见过和未见岩相分别获得了0.97以上的决定系数(R^2)值。 总的来说,这项工作通过利用深度学习技术提供了低成本、高效率的岩石薄片矿物学分析方法,展示了在油气储层评价中的潜在应用价值。
https://arxiv.org/abs/2505.17008
Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
从互联网视频中学习潜在运动对于构建通用型机器人至关重要。然而,现有的离散潜在动作方法存在信息损失的问题,并且难以处理复杂和细微的动态变化。为此我们提出了CoMo(Continuous Motion),旨在从多样化的、大规模的互联网视频中学习更为详尽的连续运动表示。 CoMo采用了早期时间特征差分机制来防止模型崩溃并抑制静态外观噪声,从而有效避免了捷径学习问题的发生。同时,遵循信息瓶颈原则,我们将潜在运动嵌入维度进行限制,以在保留足够的与动作相关的信息和最小化无关的外观噪声之间取得更好的平衡。 此外,我们还引入了两个新的评估指标,用于更加稳健且经济地评价运动并指导运动学习方法的发展:(i)动作预测线性探测MSE;(ii)过去到当前及未来到当前运动嵌入之间的余弦相似度。这两个指标对于衡量模型在不同时间和视角下保持一致性和相关性的能力至关重要。 最关键的是,CoMo展示出了强大的零样本泛化能力,使其能够为之前未见过的视频领域生成连续伪动作。这种能力使得利用从无标签视频数据集中提取的各种伪动作进行统一策略联合学习成为可能(例如跨实体视频和显著的人类演示视频),这在必要时可以结合有限标记的机器人数据进一步增强。 广泛的实验表明,与CoMo伪动作协同训练的策略在模拟和现实世界环境中使用扩散模型和自回归架构均表现出卓越性能。
https://arxiv.org/abs/2505.17006
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL
我们提出了一种基于偏微分方程(PDE)逆问题的条件采样通用框架,旨在从极度稀疏或噪声较大的测量数据中恢复完整的解。此目标通过函数空间扩散模型和插件播放指导来实现条件设置。我们的方法首先使用神经算子架构训练一个无条件且与离散化无关的去噪模型。在推断阶段,我们利用基于梯度的引导机制细化样本以满足稀疏观测数据的要求。通过严格的数学分析,我们将Tweedie公式扩展到无限维希尔伯特空间中,为我们的后验采样方法提供了理论基础。我们的方法(FunDPS)在极小监督和极端数据稀缺条件下准确捕捉函数空间中的后验分布。在五项仅包含3%观测数据的PDE任务上,与最先进的固定分辨率扩散基准相比,我们的方法平均提高了32%的准确性,并将采样步骤减少了4倍。此外,多分辨率微调确保了强大的跨分辨率泛化能力。据我们所知,这是首个独立于离散化的基于扩散的方法框架,为偏微分方程上下文中的正向和逆向问题提供了一种实用且灵活的解决方案。代码可在此网址获得:[请在这里插入实际链接]
https://arxiv.org/abs/2505.17004
We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
我们研究了面部与声音之间关联学习的任务,这一任务最近在多模态社区引起了广泛关注。这些方法面临着故意设计负样本挖掘过程以及依赖于远离边际参数的问题。这些问题通过在一个共同嵌入空间中应用正交约束来解决,该空间融合了面部和声音的嵌入表示。然而,面部和声音的嵌入空间具有不同的特性,并且在将它们融合之前需要对齐这些空间。为此,我们提出了一种方法,能够准确地对齐嵌入空间并使用增强型门控融合技术将其融合在一起,从而提高面部与声音关联任务的表现。在VoxCeleb数据集上的广泛实验揭示了所提方法的优势。
https://arxiv.org/abs/2505.17002
Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
近期在情感支持对话(ESC)领域取得的进展,通过监督微调(SFT)对大规模语言模型(LLMs)进行精细调整,提高了情感支持生成的质量。然而,常见的心理错误仍然存在。直接偏好优化(DPO)通过成对偏好学习来减少这些错误显示出潜力,但其在ESC任务中的有效性受到两个关键挑战的限制:(1) 数据结构复杂交织:现有的ESC数据固有地将心理策略和响应内容纠缠在一起,这使得构建高质量的偏好对变得困难;(2) 优化模糊性:将原始DPO应用于这种复杂的成对数据会导致训练目标不明确。为了解决这些问题,我们引入了推断式偏好挖掘(IPM)来构建高质量的偏好数据,并形成了IPM-PrefDial数据集。基于这些数据,我们提出了一个解耦ESC框架,该框架借鉴了格罗斯的情感调节扩展过程模型,将ESC任务分解为两个顺序子任务:策略规划和共鸣回应生成。每个子任务都通过SFT进行训练,并进一步通过DPO进行优化以符合心理偏好。广泛的实验表明,我们的解耦ESC框架优于联合优化基线,在减少偏好评分偏差的同时提升了响应质量。
https://arxiv.org/abs/2505.16995
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at this https URL.
大型推荐模型已经通过编码或项目生成将大规模语言模型(LLM)扩展为强大的推荐系统,并且最近在LLM推理方面的突破同步激励了在推荐领域中探索推理能力。目前的研究通常将LLM定位为外部推理模块,以提供辅助思考来增强传统的推荐流水线。然而,这种解耦设计由于显著的资源成本和次优联合优化而受到限制。为了应对这些问题,我们提出了一个统一的大规模推荐模型\name,该模型具备内在的推理能力。 首先,我们重新构想模型架构,使推理与推荐在自回归过程中能够交错进行。随后,我们提出RecPO框架,这是一个对应的强化学习框架,它通过单一策略更新同时优化\name的推理和推荐能力;RecPO引入了一种融合奖励方案,仅利用推荐标签来模拟推理能力,消除了对专门推理注释的依赖。 在三个具有不同基线的数据集上进行的实验验证了\name的有效性,显示其Hit@5(点击率)相对提高了68.67%,NDCG@20(归一化折扣累积增益)相对提高了45.21%。代码可在提供的URL获取。 这一段文字描述了一个名为“\name”的新模型及其配套的强化学习框架RecPO的设计和效果,该系统旨在改进大型推荐模型通过内置推理能力来优化推荐性能的方法,并展示了其在实验中的优越表现。
https://arxiv.org/abs/2505.16994
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.
均匀下采样一直是降低视觉骨干网络空间分辨率的事实标准。在本工作中,我们提出了一种基于内容感知的空间分组层的替代设计方案,该设计可以根据图像边界及其语义内容动态地将标记分配给一个缩小的集合中。在整个连续骨干阶段堆叠我们的分组层会产生一种层次化分割,这种分割自然出现在特征提取过程中,从而形成了我们提出的原生分割视觉变换器(Native Segmentation Vision Transformer)。我们展示了对架构进行精心设计可以使仅通过分组层就能产生强大的分割掩码,而无需额外的特定于分割的头部。这为新的原生骨干级分割范式奠定了基础,该范式可以在没有掩码监督的情况下实现强大的零样本结果,并且对于下游分割任务具有最小和高效的独立模型设计。我们的项目页面在此 [URL]。 注:原文中的项目页面链接(https URL)未给出具体网址,在实际引用时需要提供完整的网址信息。
https://arxiv.org/abs/2505.16993
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
在这项工作中,我们提出了Dimple,这是首个离散扩散多模态大型语言模型(DMLLM)。我们观察到,使用纯粹的离散扩散方法进行训练会导致显著的训练不稳定、次优性能和严重的长度偏差问题。为了应对这些挑战,我们设计了一种新的训练范式,该范式结合了初始自回归阶段与后续的扩散阶段。这种方法生成了Dimple-7B模型,其在相同的语料库上进行了训练,并使用了类似于LLaVA-NEXT的训练管道。最终,Dimple-7B以3.9%的优势超越了LLaVA-NEXT,这表明DMLLM可以实现与自回归模型相当的性能。 为了提高推理效率,我们提出了一种名为“自信解码”的解码策略,该策略在每个步骤中动态调整生成令牌的数量,显著减少了生成迭代次数。在自回归模型中,生成期间的前向迭代次数等于响应长度。然而,在使用自信解码时,Dimple所需的迭代次数仅为响应长度的$\frac{1}{3}$。 此外,我们重新实现了自回归模型中的填充技术,并展示了这种技术对大多数基准评估性能影响不大,但提供了1.5倍到7倍的速度提升。我们也探讨了Dimple利用结构先验精准控制其响应的能力。这些先验使得以不同于指令或链式思考提示的方式生成结构化回复成为可能,从而可以精确地控制回复格式和长度,而这在自回归模型中是难以实现的。 总的来说,这项工作验证了DMLLM的可行性和优势,并提高了它的推理效率和可控性。代码与模型可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2505.16990
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at this https URL.
出界(Out-of-distribution,OOD)检测和分割对于在自动驾驶和机器人辅助手术等安全关键应用中部署机器学习模型至关重要。尽管之前的大多数研究主要集中在单模态图像数据上,但现实世界的应用本质上是多模态的,需要整合多种模态以提高OOD检测的效果。一个关键挑战是没有来自未知数据的监督信号,导致模型在处理OOD样本时过于自信。为解决这一挑战,我们提出了特征混合(Feature Mixing)方法,这是一种极其简单且快速的方法,用于生成具有理论支持的多模态异常值,可以通过进一步优化帮助模型更好地区分已知分布(in-distribution,ID)和OOD数据。特征混合与模式无关,并适用于各种模态组合。 此外,我们还介绍了CARLA-OOD,这是一个新颖的多模态数据集,用于OOD分割任务,其中包含在不同场景和天气条件下合成的OOD物体。在SemanticKITTI、nuScenes、CARLA-OOD以及MultiOOD基准测试上进行的大量实验表明,特征混合方法能够实现最先进的性能,并且速度提高了10倍到370倍。我们的源代码和数据集将在[此处](https://this https URL)提供。 该段落翻译为中文后清晰地介绍了研究背景、提出的方法及其优势,以及用于验证新方法的数据集和实验结果。
https://arxiv.org/abs/2505.16985
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
训练后的微调已经证明了其在增强大规模语言模型(LLMs)推理能力方面的重要性。主要的微调方法可以分为监督微调(SFT)和强化微调(RFT)两大类。SFT因其效率高且适合小型语言模型而受到青睐,但可能导致过拟合,并限制大型模型的推理能力。相比之下,RFT通常能产生更好的泛化效果,但是高度依赖于基础模型的质量。为了克服SFT和RFT的局限性,我们提出了一种新的微调范式——统一微调(UFT),它将SFT和RFT整合为一个单一且集成的过程。UFT使模型能够有效地探索解决方案,同时融入信息丰富的监督信号,弥合了现有方法中记忆与思考之间的差距。值得注意的是,无论模型大小如何,UFT在总体上都优于SFT和RFT。此外,我们从理论上证明了UFT突破了RFT内在的指数级样本复杂性瓶颈,首次展示了统一训练能够在长时态推理任务上实现指数级加速收敛的效果。
https://arxiv.org/abs/2505.16984
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
视频虚拟试衣的目标是将视频中的主体无缝地穿上特定的衣物。主要挑战在于在动态适应主体姿势和体型的同时,保持服装的真实视觉效果。尽管现有的方法大多集中在基于图像的虚拟试穿技术上,但直接将其应用到视频中通常会导致时间上的不一致性。目前大多数视频虚拟试衣的方法通过加入时间模块来缓解这一问题,但仍忽略了人类与衣物之间的关键时空姿态互动。 为了有效解决视频中的姿势交互,在每一帧中不仅需要考虑人体和衣物姿势的空间对齐,还需要考虑到整个视频中的人体姿势的动态变化。基于此动机,我们提出了一种新的框架——动态姿态互动扩散模型(Dynamic Pose Interaction Diffusion Models, DPIDM),利用扩散模型深入探索动态姿态互动在视频虚拟试衣中的应用。 技术上,DPIDM引入了一个骨架基础的姿态适配器,将同步的人体和衣物姿势整合到去噪网络中。随后设计了一种分层注意力模块,通过基于姿态的空域和时间域注意机制来建模帧内人体与衣物的姿势互动以及跨帧长时间段内的动态变化。此外,DPIDM利用连续帧之间的正则化注意损失来增强时间一致性。 在VITON-HD、VVT 和ViViD 数据集上进行的大量实验表明了我们提出的DPIDM方法相对于基线方法的优势。值得注意的是,在VVT数据集中,DPIDM达到了VFID得分为0.506,比最先进的GPD-VVTO方法提高了60.5%。
https://arxiv.org/abs/2505.16980
Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
单一代理LLM面临硬性限制——有限的上下文、角色过载和脆弱的知识领域转移。传统多代理解决方案虽然减轻了这些问题,但也暴露出新的问题:不恰当的任务分解、模糊不清的合作协议以及验证成本高昂,削弱了改进效果。因此,我们提出了一种名为“掌握诀窍”(Know-The-Ropes, KtR)的框架,该框架将领域的先验知识转化为算法蓝图层级结构,在这种结构中,任务被递归地拆分为有类型的、由控制器中介的子任务,每个子任务要么直接解决,要么通过最轻量级的方法进行增强(例如:思维链推理、微调或自我检查)。基于“没有免费午餐”的定理,KtR放弃了寻找通用提示符的努力,转而强调有条不紊的任务分解。 在背包问题(3-8个物品)上,使用三个GPT-4o-mini代理,在补全单一瓶颈代理后,从零样本的3%准确率提高到大小为5的情况下的95%。对于更具挑战性的任务分配问题(6-15项工作),一个由六个o3-mini蓝图组成的系统在规模达到10时能够实现100%的正确率,并且在规模13-15时也能保持84%的准确度,相比之下零样本情况下的准确率为11%。 通过算法意识的任务分解加上有针对性的增强,这种框架使中等大小的模型成为可靠的合作伙伴——无需构建越来越大、越来越复杂的单一代理系统。
https://arxiv.org/abs/2505.16979
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
语法在自然语言处理和文本/代码生成中扮演着关键角色,它能够定义句法、创建解析器,并指导结构化输出。尽管大型语言模型(LLMs)在其广泛的应用领域表现出令人印象深刻的能力,但它们推断和生成语法规则的能力尚未得到充分探索。在这篇论文中,我们旨在研究并改进LLM在小样本语法生成中的能力,在这种情况下,从一组少量的正例和反例中推导出语法,并将其以Backus-Naur形式(BNF)生成出来。为了探究这一点,我们引入了一个包含540个结构化语法生成挑战的新数据集,设计了6种评估指标,并对8种不同的LLM进行了评测。我们的研究发现表明,现有的LLM在语法生成方面表现不佳。为了解决这个问题,我们提出了一种新的方法——HyGenar,这是一种由LLM驱动的混合遗传算法,旨在优化语法规则的生成过程。HyGenar显著提高了不同LLM在语法生成中的句法和语义正确性。
https://arxiv.org/abs/2505.16978