Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
最近的研究进展强调了强化学习(RL)在增强大型语言模型(LLMs)中的链式思维(CoT)推理能力方面的重要作用。两种突出的RL算法,直接偏好优化(DPO)和群体相对策略优化(GRPO),是这些发展中的核心,展示了各自的优点和缺点。自回归图像生成,也可以视为一种序列式的CoT推理过程,提出了不同于基于LLM的CoT推理的独特挑战。这些问题包括确保文本与图像的一致性、提高图像美学质量以及设计复杂的奖励模型,而不是依赖简单的规则基础奖励。虽然最近的努力已经将RL扩展到这一领域,但这些探索通常缺乏对该领域特定挑战和不同RL策略特性的深入分析。 为了填补这一空白,我们首次对自回归图像生成中GRPO和DPO算法进行了全面调查,评估它们在域内性能以及跨域泛化的能力,并审查不同的奖励模型对其能力的影响。我们的研究结果表明,GRPO和DPO展现了各自的独特优势,而且具备更强内在泛化能力的奖励模型可能会提升所应用RL算法的泛化潜力。此外,我们系统地探索了三种流行的扩展策略以增强它们在域内和跨域的能力,并对每个范式的性能扩展得出了独特的见解。 我们希望这项研究为未来开发更有效的RL算法开辟新的道路,在自回归图像生成领域实现稳健的CoT推理。代码可在[此处](https://this_https_URL)获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2505.17017
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
我们介绍了一种基于强化学习的简单且可扩展的交互式后期训练范例——RIPT-VLA,该方法仅使用稀疏二元成功奖励对预训练的视觉-语言-动作(VLA)模型进行微调。现有的VLA训练流水线依赖于大量的离线专家演示数据和监督模仿学习,这限制了它们在低数据环境下的适应能力。RIPT-VLA通过启用基于动态采样和留一法优势估计的稳定策略优化算法的交互式后期训练来解决这个问题。 RIPT-VLA具有以下特点:首先,它适用于各种VLA模型,在轻量级QueST模型上提高了21.2%,并且在7B OpenVLA-OFT模型上达到了前所未有的97.5%的成功率。其次,它在计算和数据使用方面都十分高效:仅用一次演示,RIPT-VLA就使原本几乎无法工作的SFT模型(成功率仅为4%)在经过15次迭代后将成功率提高到97%。此外,我们还展示了由RIPT-VLA学习的策略能够跨不同任务和场景进行泛化,并且对初始状态背景具有鲁棒性。 这些结果凸显了RIPT-VLA作为一种通过最小监督有效提升VLA模型后期训练性能的方法的实际价值与效果。
https://arxiv.org/abs/2505.17016
Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
近期在情感支持对话(ESC)领域取得的进展,通过监督微调(SFT)对大规模语言模型(LLMs)进行精细调整,提高了情感支持生成的质量。然而,常见的心理错误仍然存在。直接偏好优化(DPO)通过成对偏好学习来减少这些错误显示出潜力,但其在ESC任务中的有效性受到两个关键挑战的限制:(1) 数据结构复杂交织:现有的ESC数据固有地将心理策略和响应内容纠缠在一起,这使得构建高质量的偏好对变得困难;(2) 优化模糊性:将原始DPO应用于这种复杂的成对数据会导致训练目标不明确。为了解决这些问题,我们引入了推断式偏好挖掘(IPM)来构建高质量的偏好数据,并形成了IPM-PrefDial数据集。基于这些数据,我们提出了一个解耦ESC框架,该框架借鉴了格罗斯的情感调节扩展过程模型,将ESC任务分解为两个顺序子任务:策略规划和共鸣回应生成。每个子任务都通过SFT进行训练,并进一步通过DPO进行优化以符合心理偏好。广泛的实验表明,我们的解耦ESC框架优于联合优化基线,在减少偏好评分偏差的同时提升了响应质量。
https://arxiv.org/abs/2505.16995
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at this https URL.
大型推荐模型已经通过编码或项目生成将大规模语言模型(LLM)扩展为强大的推荐系统,并且最近在LLM推理方面的突破同步激励了在推荐领域中探索推理能力。目前的研究通常将LLM定位为外部推理模块,以提供辅助思考来增强传统的推荐流水线。然而,这种解耦设计由于显著的资源成本和次优联合优化而受到限制。为了应对这些问题,我们提出了一个统一的大规模推荐模型\name,该模型具备内在的推理能力。 首先,我们重新构想模型架构,使推理与推荐在自回归过程中能够交错进行。随后,我们提出RecPO框架,这是一个对应的强化学习框架,它通过单一策略更新同时优化\name的推理和推荐能力;RecPO引入了一种融合奖励方案,仅利用推荐标签来模拟推理能力,消除了对专门推理注释的依赖。 在三个具有不同基线的数据集上进行的实验验证了\name的有效性,显示其Hit@5(点击率)相对提高了68.67%,NDCG@20(归一化折扣累积增益)相对提高了45.21%。代码可在提供的URL获取。 这一段文字描述了一个名为“\name”的新模型及其配套的强化学习框架RecPO的设计和效果,该系统旨在改进大型推荐模型通过内置推理能力来优化推荐性能的方法,并展示了其在实验中的优越表现。
https://arxiv.org/abs/2505.16994
We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. At inference UniPhy allows `inverse simulation' i.e. inferring material properties by optimizing the scene-specific latent to match the available observations via differentiable simulation. In contrast to existing methods that treat such inference as system identification, UniPhy does not rely on user-specified material type information. Compared to prior neural constitutive modeling approaches which learn instance specific networks, the shared training across materials improves both, robustness and accuracy of the estimates. We train UniPhy using simulated trajectories across diverse geometries and materials -- elastic, plasticine, sand, and fluids (Newtonian & non-Newtonian). At inference, given an object with unknown material properties, UniPhy can infer the material properties via latent optimization to match the motion observations, and can then allow re-simulating the object under diverse scenarios. We compare UniPhy against prior inverse simulation methods, and show that the inference from UniPhy enables more accurate replay and re-simulation under novel conditions.
我们提出了一种名为UniPhy的通用潜在条件神经构成模型,它可以编码各种材料的物理属性。在推理阶段,UniPhy允许进行“逆向仿真”,即通过可微分仿真优化场景特定的潜在变量来匹配可用观测数据以推断材料特性。与现有的将此类推理视为系统辨识的方法不同,UniPhy不依赖于用户指定的材料类型信息。相比之前的神经构成建模方法(这些方法学习的是特定实例的网络),跨多种材料进行共享训练可以提高估计结果的准确性和鲁棒性。 我们使用跨越多样几何形状和材料类型的模拟轨迹来训练UniPhy——包括弹性体、塑性粘土、沙子,以及牛顿流体和非牛顿流体。在推理阶段,对于一个具有未知物理属性的对象,UniPhy可以通过潜在优化匹配其运动观测数据推断出物理属性,并且可以允许对该对象进行多样条件下的重仿真。我们将UniPhy与之前的逆向仿真方法进行了比较,并展示了从UniPhy得出的推理能够使新条件下回放和重仿真的准确性更高。
https://arxiv.org/abs/2505.16971
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。
https://arxiv.org/abs/2505.16932
Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility.
大型语言模型(LLMs)已成为全球人工智能应用的核心,因此需要强大的多语言安全对齐机制以确保在不同语言环境中安全部署。现有的偏好学习方法用于安全对齐,例如RLHF和DPO,主要针对单语种,并且难以处理嘈杂的多语言数据。为了解决这些限制,我们引入了多语言奖励差距优化(MPO),这是一种新颖的方法,利用主导语言(英语)的安全能力来进行跨多种语言的安全对齐改进。MPO直接最小化主导语言与目标语言之间的奖励差距差异,有效转移安全能力的同时保持主导语言的原始优势。在LLaMA-3.1、Gemma-2和Qwen2.5三个大型语言模型上的广泛实验验证了MPO在多语言安全对齐方面的有效性,并且不会降低其通用多语种功能。
https://arxiv.org/abs/2505.16869
This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs. Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy. The "adapt" phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the "generate memory" phase, guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.
这篇论文解决了在不断变化的、多种分布外(OOD)图上的图领域适应挑战。传统的图领域适应方法局限于单步适应,这使得它们无法有效处理连续的领域偏移,并且容易出现灾难性遗忘问题。本文提出了图持续自适应学习(GCAL)方法,旨在增强模型在不同图域中的可持续性和适应能力。 GCAL采用双层优化策略。“适应”阶段使用信息最大化的方法对新领域的图进行微调,并重新调整过去的记忆以减少遗忘现象的发生。同时,“生成记忆”阶段通过从信息瓶颈理论推导出的理论下限指导,包含了一个变分记忆图生成模块,将原始图压缩成记忆。 广泛的实验评估表明,GCAL在适应性和知识保留方面显著优于现有的方法。
https://arxiv.org/abs/2505.16860
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at this https URL.
强化学习(RL)已被证明是增强视觉-语言模型(VLMs)推理能力的有效训练后策略。最近,群组相对政策优化(GRPO)方法被提出,该方法鼓励模型在回答之前生成完整的推理轨迹,从而导致更多的令牌使用和计算成本增加。受人类思维方式的启发——即人们在遇到简单问题时会跳过推理过程而在需要思考时仔细考虑——我们探索了如何让VLMs能够首先判断何时需要进行推理。 为实现这一目标,我们提出了TON,这是一种两阶段训练策略:(i) 监督微调(SFT)阶段,该阶段引入了一种简单而有效的“思维丢弃”操作,在此操作中随机将推理轨迹替换为空白思考。这引入了一个考虑是否需要推理的格式,作为选择性推理的冷启动;(ii) GRPO阶段使模型能够自由探索何时进行或不进行思考,同时最大化任务感知结果奖励。 实验结果显示,与原始GRPO相比,TON可以减少完成长度高达90%,而不会牺牲性能甚至提高性能。在涵盖3B和7B模型的各种视觉-语言任务上的进一步评估表明,随着训练的进展,该模型逐渐学会了跳过不必要的推理步骤。这些发现为强化学习方法向人类类似的推理模式迈进指明了道路。 我们的代码可在上述提供的链接中获取。
https://arxiv.org/abs/2505.16854
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
社交媒体上多模态错误信息的快速传播引发了越来越多的关注,但由于缺乏大规模、多样化的数据集,有关视频错误信息检测的研究仍然有限。现有方法往往过度拟合于僵化模板,并且在处理欺骗性内容时缺乏深度推理。为解决这些挑战,我们引入了FakeVV,这是一个包含超过10万对视频-文本的数据基准集合,带有细致可解释的标注。此外,我们还提出了Fact-R1,这是一种将深层推理与基于规则的协作强化学习相结合的新框架。Fact-R1通过三个阶段进行训练:(1)错误信息长思维链(CoT)指令微调;(2)通过直接偏好优化(DPO)实现偏好转向;以及(3)使用新型可验证奖励函数进行群体相对策略优化(GRPO)。这使得Fact-R1能够在复杂的多模态错误信息环境中展现出与高级文本强化学习系统相媲美的新兴推理行为。我们的工作确立了错误信息检测的新范式,连接大规模视频理解、引导推理对齐以及可解释性验证。
https://arxiv.org/abs/2505.16836
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
扩散变换器(DiT)能够提供最先进的图像质量,但它们的训练过程仍然非常耗时。最近的一个解决方案——表示对齐(REPA),通过将DiT隐藏特征与非生成性教师模型(例如DINO)的特征匹配来加速早期训练阶段,但在后期要么停滞不前,要么性能下降。我们发现这种失败的原因是能力错配:一旦生成型学生开始建模联合数据分布,教师较低维度的嵌入和注意力模式就会成为一种限制而非指导。 为解决这一问题,我们提出了HASTE(分阶段终止的整体对齐高效训练),这是一种两阶段的时间表,旨在保持有益的因素并剔除有害的部分。第一阶段应用整体对齐损失,在中期层面上同时从教师模型中提取注意力图(关系先验)和特征投影(语义锚点)到DiT中,从而实现快速收敛。第二阶段则进行一次性终止操作,在达到一个简单的触发器(例如固定迭代次数)时停用对齐损失,让DiT能够专注于去噪任务并充分利用其生成能力。 HASTE能够在不改变架构的情况下加速各种DiTs的训练过程。在ImageNet 256x256的数据集上,它仅通过50个周期就达到了基础SiT-XL/2模型的FID评分,并且在经过500个周期后能够匹配REPA的最佳FID评分,这相当于优化步骤减少了28倍。此外,HASTE还在MS-COCO数据集上的文字到图像DiTs任务中进行了改进,证明了其作为一种简单而原则性的方法,在多种任务的高效扩散训练中具有广泛的应用价值。 我们的代码可在提供的链接中获取(原文中的链接未给出具体地址,请访问原始论文或官方页面查找)。
https://arxiv.org/abs/2505.16792
Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
基于给定的文本提示生成高质量图像的文本到图像模型非常强大,但编写这些提示通常需要专门的词汇。为了解决这一问题,现有方法通过大量手动标注的数据和经过训练的美学评估模型来训练重写模型。为了减轻训练模型时对数据规模的依赖以及由训练模型引入的偏见,我们提出了一种新颖的提示优化框架,该框架旨在将简单的用户提示重新表述为复杂的、针对文本到图像模型的提示。具体而言,我们将大型视觉语言模型(LVLM)用作重写用户的输入提示的求解器,并同时将其用作奖励模型来评估由优化后的提示生成的图像在美学和一致性方面的得分。我们利用LVLM的先验知识提供反馈——即AI反馈——而不是繁琐的人类反馈。此外,我们将求解器和奖励模型统一为一个单一的模型,在强化学习中进行迭代以通过自我判断和给出解决方案的方式实现自我改进。两个流行数据集上的实验结果表明,我们的方法优于其他强大的竞争对手。
https://arxiv.org/abs/2505.16763
Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \href{this https URL}{this https URL}.
现有的预训练模型在生成3D网格时通常会受到数据偏差的影响,导致结果质量较低。而全局强化学习(RL)方法依赖于对象级奖励机制,在捕捉局部结构细节方面表现不佳。为了应对这些挑战,我们提出了一种新的细粒度强化微调框架**Mesh-RFT**,该框架采用遮罩直接偏好优化(M-DPO),通过感知质量的面屏蔽实现局部细化。为促进高效的质量评估,我们引入了一个目标拓扑感知评分系统,利用两个指标——边界边比(BER)和拓扑分数(TS)来评估几何完整性和拓扑规范性,在物体级和面级进行评价。 通过将这些度量标准整合到细粒度RL策略中,Mesh-RFT成为了第一个能够在个体面上优化网格质量的方法。这种方法在解决局部错误的同时保持了全局一致性。实验结果显示,我们的M-DPO方法相较于预训练模型能够减少24.6%的豪斯多夫距离(HD)并提高3.8%的拓扑分数(TS),并且与全局DPO方法相比,在17.4%的HD和4.9%的TS上表现出更好的性能。这些结果表明,Mesh-RFT具有改进几何完整性和拓扑规范性的能力,并在生产级网格生成方面达到了新的最先进的水平。 项目页面:[链接](this https URL)。
https://arxiv.org/abs/2505.16761
Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
远程遥感图像-文本检索(Remote Sensing Image-Text Retrieval, RSITR)在地理信息解释、灾害监测和城市规划中扮演着关键角色,通过建立图像与文字描述之间的语义关联来实现这些目标。现有的视觉语言预训练模型(Vision-and-Language Pre-training, VLP)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法通常采用对称适配器结构来探索跨模态相关性。然而,文本模态强烈的判别性质可能会在优化过程中占主导地位,并抑制图像表示学习。因此,显著且不可忽视的跨模态不平衡优化仍然是提高模型性能的一个瓶颈。 为解决这一问题,本研究提出了一种用于RSITR任务的表征差异桥接(Representation Discrepancy Bridging, RDB)方法。一方面,设计了跨模态非对称适配器(Cross-Modal Asymmetric Adapter, CMAA),以实现模态特异化优化,并提高特征对齐能力。CMAA 包括视觉增强适配器 (Visual Enhancement Adapter, VEA) 和文本语义适配器 (Text Semantic Adapter, TSA)。VEA 通过差分注意(Differential Attention, DA)机制挖掘精细的图像特征,而TSA 则通过层次化注意力(Hierarchical Attention, HA)机制识别关键的文字语义。 另一方面,本研究将传统的单一任务检索框架扩展为双任务优化框架,并开发了双任务一致性损失 (Dual-Task Consistency Loss, DTCL)。DTCL 通过跨模态、分类和指数移动平均一致性的自适应加权组合来提高跨模态对齐的鲁棒性。 在RSICD和RSITMD数据集上的实验表明,所提出的RDB方法相比现有的最先进PEFT方法,在mR指标上提升了6%-11%,并比完全微调后的GeoRSCLIP模型提高了1.15%-2%。
https://arxiv.org/abs/2505.16756
The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at this https URL.
大型语言模型(LLM)的重大进展已经在众多应用中取得了显著成就。然而,它们生成有害内容的能力引发了重大的安全担忧。尽管在预训练阶段实施了安全性对齐技术,但近期研究表明,在对抗性或甚至良性数据上进行微调可能会无意中破坏其安全性。在这篇论文中,我们重新审视了一个基本问题:为何在非有害数据上的微调仍会导致安全性的下降。我们引入了一种安全感知探测(Safety-Aware Probing, SAP)的优化框架,旨在降低LLM微调的安全风险。具体而言,SAP将一个安全感知探针整合到梯度传播过程中,在识别潜在的不利梯度方向的同时降低了模型在安全性方面的风险,从而增强了特定任务的表现,并成功保持了模型的安全性。我们广泛的实验结果表明,SAP有效地减少了有害性的发生率,并且其测试损失与标准微调方法相当。我们的代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2505.16737
Optimal decision-making under partial observability requires agents to balance reducing uncertainty (exploration) against pursuing immediate objectives (exploitation). In this paper, we introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs) that explicitly addresses this challenge. Our method casts policy learning as probabilistic inference in a non-Markovian Feynman--Kac model that inherently captures the value of information gathering by anticipating future observations, without requiring extrinsic exploration bonuses or handcrafted heuristics. To optimize policies under this model, we develop a nested sequential Monte Carlo~(SMC) algorithm that efficiently estimates a history-dependent policy gradient under samples from the optimal trajectory distribution induced by the POMDP. We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty.
在不完全可观测的情况下做出最优决策,要求代理体在减少不确定性(探索)和追求即时目标(开发)之间取得平衡。本文介绍了一种针对连续部分可观察马尔科夫决策过程(POMDPs)的新颖策略优化框架,该框架明确地解决了这一挑战。我们的方法将策略学习视为非马尔科夫费曼-卡克模型中的概率推理问题,这种模型内在地捕捉了信息收集的价值,并通过预测未来的观测值来体现这一点,而不需额外的探索奖励或手工设计的启发式方法。 为了在该模型下优化策略,我们开发了一种嵌套序列蒙特卡洛(SMC)算法,可以有效地估计依赖于历史记录的政策梯度,在POMDP诱导出的最佳轨迹分布样本上进行估计。我们在一系列标准连续POMDP基准测试中展示了我们算法的有效性,在这些基准测试中,现有方法在不确定条件下难以采取行动。
https://arxiv.org/abs/2505.16732
This paper presents a new approach for 6DoF Direct LiDAR-Inertial Odometry (D-LIO) based on the simultaneous mapping of truncated distance fields on CPU. Such continuous representation (in the vicinity of the points) enables working with raw 3D LiDAR data online, avoiding the need of LiDAR feature selection and tracking, simplifying the odometry pipeline and easily generalizing to many scenarios. The method is based on the proposed Fast Truncated Distance Field (Fast-TDF) method as a convenient tool to represent the environment. Such representation enables i) solving the LiDAR point-cloud registration as a nonlinear optimization process without the need of selecting/tracking LiDAR features in the input data, ii) simultaneously producing an accurate truncated distance field map of the environment, and iii) updating such map at constant time independently of its size. The approach is tested using open datasets, aerial and ground. It is also benchmarked against other state-of-the-art odometry approaches, demonstrating the same or better level of accuracy with the added value of an online-generated TDF representation of the environment, that can be used for other robotics tasks as planning or collision avoidance. The source code is publicly available at this https URL
这篇论文提出了一种新的6自由度直接激光雷达-惯性里程计(D-LIO)方法,该方法基于在CPU上同时绘制截断距离场。这种连续表示(点附近)允许在线处理原始3D激光雷达数据,避免了对激光雷达特征选择和跟踪的需求,简化了里程计流程,并且可以轻松地推广到多种场景中。 该方法以提出的快速截断距离场(Fast-TDF)技术为基础,作为环境表示的便捷工具。这种表示方式使得i) 无需在输入数据中选择/追踪激光雷达特征的情况下解决点云配准问题,ii) 同时生成环境的准确截断距离场地图,以及iii) 在与地图大小无关的前提下以恒定时间更新该地图。 这种方法已在公开的数据集(包括空中和地面)上进行了测试,并与其他最先进的里程计方法进行了基准对比。结果表明,在准确性方面达到了相同或更高的水平,并且提供了在线生成环境TDF表示的附加价值,这种表示可用于路径规划或其他机器人任务如碰撞避免等。 源代码可在此网址获取:[此URL处应填写实际链接]
https://arxiv.org/abs/2505.16726
While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{this https URL}{here}.
虽然长上下文的大语言模型(LLMs)展现了出色的文档处理能力,但其高昂的训练成本通常阻碍了定制化应用。为了解决这一问题,我们提出了\textit{顺序分块优化}(Sequential Chunk-wise Optimization, SeCO),这是一种内存高效的训练范式,将较长输入划分成可管理的小块。每个小块独立构建计算图并执行局部反向传播,确保仅存储一个块的前向激活在内存中。在此基础上,我们进一步引入了\textit{稀疏分块优化}(Sparse Chunk-wise Optimization, SpaCO),通过选择性地将梯度传递到特定的小块来减少计算开销,并结合精心设计的补偿因子以保证无偏的梯度估计。SpaCO解耦了反向传播的计算成本与上下文长度,使得随着序列变长,训练时间逐渐接近推理时间。SeCO和SpaCO作为轻量级训练包装器被实现,两者都提供了实质性的实用效益。例如,在单个RTX 3090 GPU上使用LoRA微调8B模型时,SeCO将最大序列长度从1K扩展到16K标记,而SpaCO则展示了加速的训练速度——在相同的实验设置下比SeCO快多达3倍。这些创新为优化长上下文模型提供了新的见解,使其更易于应用于实际场景中。我们已开源代码于\href{this https URL}{此处}。
https://arxiv.org/abs/2505.16710
While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.
尽管最近基于扩散的生成图像编解码器展示了令人印象深刻的性能,但其迭代采样过程引入了不愉快的延迟。在这项工作中,我们重新审视了基于扩散的编解码器的设计,并认为多步采样对于生成式压缩并非必要。根据这一洞察,我们提出了OneDC——一种一步式的基于扩散的生成图像编解码器,它将隐变量压缩模块与一步式扩散生成器集成在一起。 认识到语义引导在一步式扩散中的关键作用,我们提议使用超先验作为语义信号,从而克服文本提示在表示复杂视觉内容方面的局限性。为了进一步增强超先验的语义能力,我们引入了一种知识蒸馏机制,将预训练的生成标记器的知识转移到超先验编解码器中。此外,我们采用了一种混合像素域和隐变量域优化方法,以同时提升重建保真度和感知逼真度。 广泛的实验表明,即使在一步式生成的情况下,OneDC也能达到SOTA(State-of-the-Art)的感知质量,并且与之前的多步扩散编解码器相比,在比特率方面减少了超过40%,解码速度提高了20倍。代码将在稍后发布。
https://arxiv.org/abs/2505.16687