Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
视觉生成模型在从文本提示创建逼真的图像方面取得了显著进展,但处理包含多个对象及其精确空间关系和属性的复杂提示时仍面临挑战。有效应对这些复杂提示需要对语义内容和空间布局进行明确推理。我们提出了GoT-R1框架,该框架利用强化学习来增强视觉生成中的语义-空间推理能力。基于生成链式思维方法,GoT-R1使模型能够自主发现超越预定义模板的有效推理策略,这得益于精心设计的强化学习机制。 为了实现这一点,我们提出了一种双阶段多维度奖励框架,该框架利用大规模语言模型(MLLMs)来评估推理过程和最终输出,从而在整个生成管道中提供有效的监督。奖赏系统以统一的方式评估语义一致性、空间准确性以及视觉质量。 实验结果显示,在涉及精确空间关系和属性绑定的组合任务上,GoT-R1在T2I-CompBench基准测试中取得了显著改进,尤其是在处理复杂的组成性任务方面。GoT-R1通过成功将高级推理能力转移到图像生成领域,推动了这一领域的最新研究进展。 为了促进未来的相关研究工作,我们已在[此链接](https://此链接提供具体URL)上公开发布了代码和预训练模型。
https://arxiv.org/abs/2505.17022
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.
最近的研究表明,通过基于规则的强化学习(RL)利用结果奖励可以成功地激发多模态大型语言模型(MLLMs)的强大推理能力。然而,这种范式通常缺乏对最终答案形成过程中的思维流程进行监督,因此模型可能会学到次优的推理策略,这会妨碍其泛化能力。为此,我们提出了SophiaVL-R1,试图在这个框架中为思考过程添加奖励信号。 为了实现这一目标,我们首先训练了一个评估整个思维过程质量的思维奖励模型。鉴于某些样本中的思维奖励可能由于奖励操控而不可靠,我们提出了一种Trust-GRPO方法,在此过程中根据导致正确答案和错误答案响应之间的思维奖励比较来分配一个可信度权重。这种方法有助于减轻潜在不准确思维奖励的影响。 此外,我们设计了一个退火训练策略,该策略会随着时间的推移逐渐减少思维奖励的重要性,使得模型在后期的训练阶段能够更多地依赖于准确的结果奖励。实验结果表明,我们的SophiaVL-R1在多个基准测试(如MathVisita、MMMU)上超越了一系列推理MLLMs,展示了强大的推理和泛化能力。特别值得注意的是,在大多数基准测试中,我们的参数量较少的SophiaVL-R1-7B模型甚至超过了具有10倍更多参数的LLaVA-OneVision-72B。 所有代码、模型和数据集均已公开发布在以下链接:[此链接需要您根据原文进行填写]。
https://arxiv.org/abs/2505.17018
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
最近的研究进展强调了强化学习(RL)在增强大型语言模型(LLMs)中的链式思维(CoT)推理能力方面的重要作用。两种突出的RL算法,直接偏好优化(DPO)和群体相对策略优化(GRPO),是这些发展中的核心,展示了各自的优点和缺点。自回归图像生成,也可以视为一种序列式的CoT推理过程,提出了不同于基于LLM的CoT推理的独特挑战。这些问题包括确保文本与图像的一致性、提高图像美学质量以及设计复杂的奖励模型,而不是依赖简单的规则基础奖励。虽然最近的努力已经将RL扩展到这一领域,但这些探索通常缺乏对该领域特定挑战和不同RL策略特性的深入分析。 为了填补这一空白,我们首次对自回归图像生成中GRPO和DPO算法进行了全面调查,评估它们在域内性能以及跨域泛化的能力,并审查不同的奖励模型对其能力的影响。我们的研究结果表明,GRPO和DPO展现了各自的独特优势,而且具备更强内在泛化能力的奖励模型可能会提升所应用RL算法的泛化潜力。此外,我们系统地探索了三种流行的扩展策略以增强它们在域内和跨域的能力,并对每个范式的性能扩展得出了独特的见解。 我们希望这项研究为未来开发更有效的RL算法开辟新的道路,在自回归图像生成领域实现稳健的CoT推理。代码可在[此处](https://this_https_URL)获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2505.17017
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
大型语言模型(LLM)非常强大,但由于静态知识的限制,它们容易产生幻觉。检索增强生成(RAG)通过注入外部信息来帮助解决这一问题,但目前的方法往往成本高昂、泛化能力差或忽视了模型内部的知识。在本文中,我们引入了一个名为 R1-Searcher++ 的新框架,旨在训练 LLM 自适应地利用内外部知识源。 R1-Searcher++ 采用两阶段的训练策略:初始的 SFT Cold-start 阶段用于初步格式学习,随后是使用结果监督进行动态知识获取的强化学习(RL)阶段。该 RL 阶段包括一个奖励机制来鼓励模型利用内部知识,并结合了记忆机制以持续吸收检索到的信息,从而丰富模型的内部知识。 通过整合内部知识和外部搜索引擎,R1-Searcher++ 模型能够不断提升其能力,实现高效的检索增强推理。我们的实验表明 R1-Searcher++ 超过了之前的 RAG 和推理方法,并实现了高效检索。代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.17005
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at this https URL.
大型推荐模型已经通过编码或项目生成将大规模语言模型(LLM)扩展为强大的推荐系统,并且最近在LLM推理方面的突破同步激励了在推荐领域中探索推理能力。目前的研究通常将LLM定位为外部推理模块,以提供辅助思考来增强传统的推荐流水线。然而,这种解耦设计由于显著的资源成本和次优联合优化而受到限制。为了应对这些问题,我们提出了一个统一的大规模推荐模型\name,该模型具备内在的推理能力。 首先,我们重新构想模型架构,使推理与推荐在自回归过程中能够交错进行。随后,我们提出RecPO框架,这是一个对应的强化学习框架,它通过单一策略更新同时优化\name的推理和推荐能力;RecPO引入了一种融合奖励方案,仅利用推荐标签来模拟推理能力,消除了对专门推理注释的依赖。 在三个具有不同基线的数据集上进行的实验验证了\name的有效性,显示其Hit@5(点击率)相对提高了68.67%,NDCG@20(归一化折扣累积增益)相对提高了45.21%。代码可在提供的URL获取。 这一段文字描述了一个名为“\name”的新模型及其配套的强化学习框架RecPO的设计和效果,该系统旨在改进大型推荐模型通过内置推理能力来优化推荐性能的方法,并展示了其在实验中的优越表现。
https://arxiv.org/abs/2505.16994
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{this https URL}{this https URL}.
大型语言模型(LLMs)在各种软件工程任务中展现了强大的能力,例如代码补全、错误修复和文档生成。然而,特性驱动开发(FDD),这是一种高度流行的真实世界任务,涉及到为庞大的现有代码库添加新功能,这一领域仍然被较少探索。为此,我们引入了SWE-Dev,这是首个大规模数据集(包含14,000个训练样本和500个测试样本),旨在评估和训练自动编码系统在真实世界的特性开发任务上的表现。为了确保可验证且多样化的训练过程,SWE-Dev独特地为所有实例提供了一个运行环境及其由开发者编写的执行单元测试。 这个数据集不仅提供了高质量的数据用于监督微调(SFT),而且还通过提供来自可执行单元测试的准确奖励信号支持强化学习(RL)。我们在SWE-Dev上进行了广泛评估,涵盖了17个聊天机器人LLM、10个推理模型和10个多智能体系统(MAS),发现FDD是当前AI面临的深刻挑战前沿(例如,Claude-3.7-Sonnet在困难测试分割上的Pass@3仅达到22.45%)。至关重要的是,我们展示了SWE-Dev作为一个有效的模型改进平台的作用:在训练集上进行微调使一个70亿参数的模型在“困难”分段的表现可媲美GPT-4o,强调了其高质量训练数据的价值。 代码可以在[\href{this https URL}{此处}]获取。
https://arxiv.org/abs/2505.16975
Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.
通过在线强化学习(RL)提升预训练策略的性能是一个重要但具有挑战性的课题。现有的在线RL微调方法需要继续使用离线预训练的价值函数Q进行稳定性和性能优化。然而,这些离线预训练的价值函数通常会低估超出离线数据集的状态-动作对,因为大多数离线RL方法都偏向保守,这在从离线转向在线设置时阻碍了进一步的探索。此外,这一需求限制了它们在只有预训练策略而没有预训练价值函数可用的情景下的适用性,比如模仿学习(IL)预训练中就是如此。 为了解决这些挑战,我们提出了一种仅使用离线预训练策略进行高效在线RL微调的方法,从而消除对预训练价值函数的依赖。我们引入了PORL(Policy-Only Reinforcement Learning Fine-Tuning),该方法在在线阶段从头开始快速初始化Q函数,以避免有害的悲观态度。我们的方法不仅实现了与先进的离线到在线RL算法和利用数据或策略先验的在线RL方法相当的性能,而且还为直接微调行为克隆(BC)策略开辟了一条新的路径。
https://arxiv.org/abs/2505.16856
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at this https URL.
强化学习(RL)已被证明是增强视觉-语言模型(VLMs)推理能力的有效训练后策略。最近,群组相对政策优化(GRPO)方法被提出,该方法鼓励模型在回答之前生成完整的推理轨迹,从而导致更多的令牌使用和计算成本增加。受人类思维方式的启发——即人们在遇到简单问题时会跳过推理过程而在需要思考时仔细考虑——我们探索了如何让VLMs能够首先判断何时需要进行推理。 为实现这一目标,我们提出了TON,这是一种两阶段训练策略:(i) 监督微调(SFT)阶段,该阶段引入了一种简单而有效的“思维丢弃”操作,在此操作中随机将推理轨迹替换为空白思考。这引入了一个考虑是否需要推理的格式,作为选择性推理的冷启动;(ii) GRPO阶段使模型能够自由探索何时进行或不进行思考,同时最大化任务感知结果奖励。 实验结果显示,与原始GRPO相比,TON可以减少完成长度高达90%,而不会牺牲性能甚至提高性能。在涵盖3B和7B模型的各种视觉-语言任务上的进一步评估表明,随着训练的进展,该模型逐渐学会了跳过不必要的推理步骤。这些发现为强化学习方法向人类类似的推理模式迈进指明了道路。 我们的代码可在上述提供的链接中获取。
https://arxiv.org/abs/2505.16854
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
社交媒体上多模态错误信息的快速传播引发了越来越多的关注,但由于缺乏大规模、多样化的数据集,有关视频错误信息检测的研究仍然有限。现有方法往往过度拟合于僵化模板,并且在处理欺骗性内容时缺乏深度推理。为解决这些挑战,我们引入了FakeVV,这是一个包含超过10万对视频-文本的数据基准集合,带有细致可解释的标注。此外,我们还提出了Fact-R1,这是一种将深层推理与基于规则的协作强化学习相结合的新框架。Fact-R1通过三个阶段进行训练:(1)错误信息长思维链(CoT)指令微调;(2)通过直接偏好优化(DPO)实现偏好转向;以及(3)使用新型可验证奖励函数进行群体相对策略优化(GRPO)。这使得Fact-R1能够在复杂的多模态错误信息环境中展现出与高级文本强化学习系统相媲美的新兴推理行为。我们的工作确立了错误信息检测的新范式,连接大规模视频理解、引导推理对齐以及可解释性验证。
https://arxiv.org/abs/2505.16836
Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.
最近的研究表明,通过将强化学习与基于规则的奖励机制相结合,可以显著提升大型语言模型的推理能力,而无需进行监督微调。然而,现有的广泛使用的强化学习算法,如GRPO及其变体DAPO,在计算优势时存在粒度粗糙的问题:它们以序列级别来计算回报差异(advantage),给序列中的每个标记赋予相同的值,从而无法捕捉到特定标记的独特贡献,并且阻碍了有效的学习过程。 为了解决这一局限性,我们提出了一种新颖的算法——关键令牌优势估计(Key-token Advantage Estimation, KTAE)。KTAE能够在不引入额外模型的情况下,估算出细粒度、以标记级别为基础的优势。该方法利用采样序列的准确性,并运用统计分析来量化序列中每个标记对最终结果的重要性。然后将这种量化的标记级重要性与序列级别的优势相结合,从而获得更为细化的标记级优势估计。 实验证明,在五个数学推理基准上,采用GRPO+KTAE和DAPO+KTAE训练得到的模型优于基线方法,并且在使用相同的基模型时,它们不仅能够达到更高的准确性,还能生成更短的回答,甚至超越了R1-Distill-Qwen-1.5B的表现。
https://arxiv.org/abs/2505.16826
Serious Games (SGs) are nowadays shifting focus to include procedural content generation (PCG) in the development process as a means of offering personalized and enhanced player experience. However, the development of a framework to assess the impact of PCG techniques when integrated into SGs remains particularly challenging. This study proposes a methodology for automated evaluation of PCG integration in SGs, incorporating deep reinforcement learning (DRL) game testing agents. To validate the proposed framework, a previously introduced SG featuring card game mechanics and incorporating three different versions of PCG for nonplayer character (NPC) creation has been deployed. Version 1 features random NPC creation, while versions 2 and 3 utilize a genetic algorithm approach. These versions are used to test the impact of different dynamic SG environments on the proposed framework's agents. The obtained results highlight the superiority of the DRL game testing agents trained on Versions 2 and 3 over those trained on Version 1 in terms of win rate (i.e. number of wins per played games) and training time. More specifically, within the execution of a test emulating regular gameplay, both Versions 2 and 3 peaked at a 97% win rate and achieved statistically significant higher (p=0009) win rates compared to those achieved in Version 1 that peaked at 94%. Overall, results advocate towards the proposed framework's capability to produce meaningful data for the evaluation of procedurally generated content in SGs.
如今,严肃游戏(SG)的发展正逐渐转向在开发过程中纳入程序化内容生成(PCG),以提供个性化和增强的玩家体验。然而,评估将PCG技术集成到SG中的影响框架的开发仍然极具挑战性。本研究提出了一种结合深度强化学习(DRL)游戏测试代理来自动化评估SG中PCG整合的方法。为了验证所提出的框架的有效性,部署了一个先前引入的游戏,该游戏具有纸牌游戏机制,并且包含了用于非玩家角色(NPC)创建的三个不同版本的PCG。 第一个版本采用随机方式生成NPC,而第二个和第三个版本则使用遗传算法方法。这些不同的动态SG环境被用来测试本研究提出的框架及其代理的效果。实验结果显示,在胜率(即每玩的游戏中的胜利次数)和训练时间方面,经过Version 2和Version 3训练的DRL游戏测试代理优于在Version 1上训练出的结果。 具体而言,在模拟常规游戏玩法的测试执行中,Versions 2和3达到了97%的最高胜率,并且相较于达到94%峰值胜率的Version 1版本,它们的胜率显著更高(p=0.0009)。总体来看,这些结果支持了所提出的框架能够产生有意义的数据用于评估严肃游戏中的程序化生成内容。
https://arxiv.org/abs/2505.16801
Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. MBRL methods have progressed by largely prioritising the actor; optimising the world model learning has been neglected meanwhile. Improving the fidelity of the world model and reducing its time to convergence can yield significant downstream benefits, one of which is improving the ensuing performance of any actor it may train. We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions generated by the world model, offering a principled alternative to traditional curiosity-driven methods that chase once-novel states well after they were stumbled into. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multi step plans after every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to just Dreamer as a proof of concept. Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.
基于模型的强化学习(MBRL)提供了一种直观的方法,可以通过同时训练一个能够预测未来的世界模型来提高无模型RL方法的样本效率。虽然MBRL方法主要侧重于优化演员(actor),但忽略了对世界模型的学习优化。通过改进世界模型的真实度并减少其收敛时间,可以带来显著的好处之一就是提升它所训练出的任何代理(agent)的表现性能。我们提出了一种新颖的方法,该方法能够预测由世界模型生成的短时地平线隐式状态,并主动寻找高熵状态,这为传统的好奇心驱动方法提供了一个原理性的替代方案,后者会在很久之后才追逐那些偶然发现的新颖状态。尽管许多基于模型预测控制(MPC)的方法提供了类似的替代方案,但它们通常缺乏一致性,在每一步后都会合成多步计划。为了缓解这一问题,我们提出了一种分层规划者,它可以动态地决定何时重新规划、规划的地平线长度以及奖励和熵之间的权重分配。 虽然我们的方法理论上可以应用于任何能够仅通过模型生成的数据训练自己代理的模型上,但在本次研究中,我们仅将其应用到Dreamer作为概念验证。与基准Dreamer相比,使用这种方法完成Miniworld程序生成迷宫的速度提高了50%,并且在想象空间内进行的策略训练只需60%的基础Dreamer所需环境步骤数即可收敛。
https://arxiv.org/abs/2505.16787
Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
基于给定的文本提示生成高质量图像的文本到图像模型非常强大,但编写这些提示通常需要专门的词汇。为了解决这一问题,现有方法通过大量手动标注的数据和经过训练的美学评估模型来训练重写模型。为了减轻训练模型时对数据规模的依赖以及由训练模型引入的偏见,我们提出了一种新颖的提示优化框架,该框架旨在将简单的用户提示重新表述为复杂的、针对文本到图像模型的提示。具体而言,我们将大型视觉语言模型(LVLM)用作重写用户的输入提示的求解器,并同时将其用作奖励模型来评估由优化后的提示生成的图像在美学和一致性方面的得分。我们利用LVLM的先验知识提供反馈——即AI反馈——而不是繁琐的人类反馈。此外,我们将求解器和奖励模型统一为一个单一的模型,在强化学习中进行迭代以通过自我判断和给出解决方案的方式实现自我改进。两个流行数据集上的实验结果表明,我们的方法优于其他强大的竞争对手。
https://arxiv.org/abs/2505.16763
Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \href{this https URL}{this https URL}.
现有的预训练模型在生成3D网格时通常会受到数据偏差的影响,导致结果质量较低。而全局强化学习(RL)方法依赖于对象级奖励机制,在捕捉局部结构细节方面表现不佳。为了应对这些挑战,我们提出了一种新的细粒度强化微调框架**Mesh-RFT**,该框架采用遮罩直接偏好优化(M-DPO),通过感知质量的面屏蔽实现局部细化。为促进高效的质量评估,我们引入了一个目标拓扑感知评分系统,利用两个指标——边界边比(BER)和拓扑分数(TS)来评估几何完整性和拓扑规范性,在物体级和面级进行评价。 通过将这些度量标准整合到细粒度RL策略中,Mesh-RFT成为了第一个能够在个体面上优化网格质量的方法。这种方法在解决局部错误的同时保持了全局一致性。实验结果显示,我们的M-DPO方法相较于预训练模型能够减少24.6%的豪斯多夫距离(HD)并提高3.8%的拓扑分数(TS),并且与全局DPO方法相比,在17.4%的HD和4.9%的TS上表现出更好的性能。这些结果表明,Mesh-RFT具有改进几何完整性和拓扑规范性的能力,并在生产级网格生成方面达到了新的最先进的水平。 项目页面:[链接](this https URL)。
https://arxiv.org/abs/2505.16761
In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at this https URL.
在这项工作中,我们旨在通过强化学习(RL)激励多模态大型语言模型(MLLMs)的推理能力,并开发一种有效的方法来缓解在RL过程中出现的稀疏奖励和优势消失问题。为此,我们提出了Share-GRPO,这是一种新颖的RL方法,它通过探索并共享扩展问题空间中的多样化推理轨迹来解决这些问题。具体来说,Share-GRPO 首先通过数据转换技术扩大给定问题的问题空间,然后鼓励 MLLM 在扩展的问题空间中有效地探索多样化的推理轨迹,并在 RL 过程中跨扩展问题分享发现的推理轨迹。此外,Share-GRPO 也在优势计算期间共享奖励信息,这可以分层次估计不同和内部问题变体之间的解决方案优势,从而更准确地估算相对优势并提高策略训练的稳定性。广泛的评估结果显示,在六个广泛使用的推理基准测试上我们的方法表现出了优越性。代码将在[此处](https://this https URL)提供。
https://arxiv.org/abs/2505.16673
Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
最近,大型语言模型(LLM)在机器翻译(MT)方面展示了显著的能力。然而,大多数先进的专门用于MT的LLM在训练过程中严重依赖外部监督信号,例如人工注释参考数据或经过训练的奖励模型(RMs),这些信号往往难以获取且不易扩展。为克服这一限制,我们提出了一种基于自我奖励的简单强化学习(SSR-RL)框架,该框架适用于MT,并且无需参考数据、完全在线进行,仅依靠自评奖励。 使用13K单语料例句和Qwen-2.5-7B作为骨干模型训练后,我们的SSR-Zero-7B模型在WMT23、WMT24及Flores200基准测试中的英汉互译任务中超越了现有的专门用于MT的LLM(如TowerInstruct-13B和GemmaX-28-9B),以及更大规模的一般LLM,例如Qwen2.5-32B-Instruct。此外,通过结合来自COMET的外部监督信号,我们的最强模型SSR-X-Zero-7B在英汉互译任务中达到了最先进的性能,在参数量小于720亿的所有现有开源模型中表现最佳,并且甚至超越了部分闭源模型(例如GPT-4o和Gemini 1.5 Pro)。 我们的分析突显了自我奖励机制相对于外部LLM作为评判者的方法在MT中的有效性,同时也展示了它与训练好的RMs结合时的互补优势。这些发现为自改善RL方法的潜力提供了有价值的见解。我们已公开发布了代码、数据和模型。
https://arxiv.org/abs/2505.16637
Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
大型语言模型(LLMs),尽管取得了显著的进步,但仍然受到静态参数知识的限制,在处理需要最新开放领域信息的任务时表现不佳。虽然让LLMs能够与外部知识环境互动是解决这一问题的一个有前景的方法,目前的努力主要集中在封闭型问题上。而开放式问题,其特点是缺乏标准答案或提供非唯一且多样化的答案,则仍然未被充分探索。 为弥合这个差距,我们提出了O$^2$-Searcher,这是一种新型的搜索代理,利用强化学习有效地解决开放领域中的开放式和封闭式问题。O$^2$-Searcher利用一个高效的本地模拟搜索环境来获取动态知识,从而将外部世界的知识与模型复杂的推理过程解耦。它采用了一种统一的训练机制,并设计了精心考虑的奖励函数,使代理能够识别问题类型并适应不同的答案生成策略。 此外,为了评估在复杂开放式任务上的性能,我们构建了O$^2$-QA,这是一个高质量的基准测试集,包含300个手动策划的跨领域开放式问题及相应的网页缓存。广泛的实验表明,使用仅3B参数模型的O$^2$-Searcher在O$^2$-QA上的表现明显优于领先的LLM代理。同时,在各种封闭式问答基准测试中,它也取得了与同样大小的模型相比的最佳结果,并且性能可媲美更大规模的模型。
https://arxiv.org/abs/2505.16582
In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
在强化学习的零样本策略转移设置中,目标是通过在一个固定的一组训练环境中对代理进行训练,使其能够推广到类似但未见过的测试环境中。先前的研究表明,在训练之后通过策略蒸馏有时可以产生一个优于原策略的策略,尤其是在测试环境中的表现更佳。然而,目前还不完全清楚为什么会这样,以及在蒸馏策略时应该使用什么数据。 在这篇论文中,我们证明了在某些假设条件下,针对训练后策略蒸馏的一般化边界。该理论提供了两个实用见解:为了提高一般化的性能,应1) 训练一个蒸馏策略的集合,并2) 尽可能多地使用来自训练环境的数据进行蒸馏。我们通过实验验证,在更通用的情况下即使所需的假设不再成立时,这些洞察仍然有效。最后,我们展示了在一个多样数据集上蒸馏得到的一系列策略能够显著优于原始代理在推广能力上的表现。
https://arxiv.org/abs/2505.16581
Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
大型语言模型(LLMs)通过链式思维(CoT,Chain-of-Thought)推理实现了卓越的性能,但这种基于token级别的推理链条在计算上是非常昂贵且效率低下的。在这篇论文中,我们介绍了Compressed Latent Reasoning(CoLaR),这是一种新颖的框架,它通过两阶段训练方法动态地压缩隐空间中的推理过程。 首先,在监督微调过程中,CoLaR不仅进行下一个token预测,还加入了辅助性的下个压缩嵌入预测目标。这个过程使用从预定义范围内随机采样的压缩因子合并连续tokens的嵌入,并且训练一个专门化的潜在头部来预测后续压缩嵌入的分布。 其次,我们通过强化学习(RL)进一步增强CoLaR,该方法利用了潜在头部的非确定性特性来探索多样性的推理路径并利用更紧凑的路径。这种方法使CoLaR能够: i) 在密集的潜在级别(即安静地)进行推理,显著减少推理链的长度; ii) 通过在推理时间仅提示所需的压缩因子,动态调整推理速度。 广泛的实验证明,在四个数学推理数据集上,与基于隐式的基线方法相比,在相似的压缩比率下,CoLaR能提高14.1%的准确性,并且将推理链长度减少了53.3%,而相对于显式的CoT方法仅损失了4.8%的表现。此外,应用于更复杂的数学推理任务时,我们的增强型RL-CoLaR能够表现出高达5.4%的性能提升,同时显著减少潜在推理链长度达82.8%。 论文通过接受后会公开代码和模型。
https://arxiv.org/abs/2505.16552
This paper presents an end-to-end deep reinforcement learning (RL) framework for occlusion-aware robotic manipulation in cluttered plant environments. Our approach enables a robot to interact with a deformable plant to reveal hidden objects of interest, such as fruits, using multimodal observations. We decouple the kinematic planning problem from robot control to simplify zero-shot sim2real transfer for the trained policy. Our results demonstrate that the trained policy, deployed using our framework, achieves up to 86.7% success in real-world trials across diverse initial conditions. Our findings pave the way toward autonomous, perception-driven agricultural robots that intelligently interact with complex foliage plants to "find the fruit" in challenging occluded scenarios, without the need for explicitly designed geometric and dynamic models of every plant scenario.
本文提出了一种用于处理茂密植物环境中遮挡问题的机器人操作的端到端深度强化学习(RL)框架。我们的方法使机器人能够与可变形的植物交互,以揭示隐藏的目标物体(如水果),并利用多模态观测数据进行操作。我们将运动规划问题从机器人控制中分离出来,简化了训练政策在零样本仿真到现实环境转换中的应用。实验结果表明,在各种初始条件下,使用我们框架部署的训练策略,在真实世界试验中的成功率达到高达86.7%。我们的研究为自主感知驱动的农业机器人铺平了道路,这些机器人能够智能地与复杂的植物相互作用,从而在遮挡严重的场景中“找到水果”,而无需为每种植物情况设计显式的几何和动态模型。
https://arxiv.org/abs/2505.16547