As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
随着大型语言模型(LLM)在文本推理方面取得了显著进展,增强大型视觉-语言模型(LVLM)的多模态推理能力的兴趣也随之增加。然而,现有的方法主要以直接、文本中心的方式处理多模态推理,在这种方式中,无论是推理还是答案推导都完全通过文本进行,唯一的区别在于存在多模态输入。因此,这些方法在需要精确几何理解及连续空间跟踪的任务上(人类通常通过心理可视化和操作来实现这些能力)往往遇到根本性的局限性。 为了解决这些问题,我们提出了一种新的范式——“空间绘图推理”,使LVLM可以通过基本的绘制操作在视觉空间中进行推理。通过赋予模型诸如标注边界框及绘制辅助线等基础绘图操作的能力,它们能够直接通过视觉操控表达和分析空间关系,并且避免了之前工具整合推理方法中存在的专业感知工具性能上限问题。 为了培养这种能力,我们开发了一个三阶段的训练框架:使用合成数据进行冷启动训练以建立基本绘图技能;采用反射拒绝采样增强自我反思行为;以及直接针对目标奖励优化的强化学习。广泛的实验表明,我们的模型VILASR在包括迷宫导航、静态空间推理、基于视频的推理和多视角推理任务在内的多样化空间推理基准测试中均显著优于现有方法,平均提升了18.4%。
https://arxiv.org/abs/2506.09965
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.
带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLM)的关键技术,其中验证工程发挥了核心作用。然而,用于指令遵循的最佳强化学习实践尚未得到充分探索。在这项工作中,我们探讨了在指令跟随中实现RL所面临的验证挑战,并提出了一种名为VerIF的方法,该方法结合了基于规则的代码验证和大型推理模型(如QwQ-32B)中的LLM验证。为了支持这种方法,我们构建了一个高质量的指令遵循数据集VerInstruct,其中包括约22,000个实例及其相关的验证信号。我们将使用VerIF进行RL训练应用于两个模型,并在几个代表性的指令跟随基准测试中实现了显著改进。经过训练的模型在同类大小的模型中达到了最先进的性能水平,并且对未见过的约束具有良好的泛化能力。我们进一步观察到,它们的一般能力并未受到影响,这表明带有VerIF的RL可以整合到现有的RL配方中以提升整体模型性能。我们在[此链接](https://example.com/)发布了我们的数据集、代码和模型,以促进未来的研究。
https://arxiv.org/abs/2506.09942
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $O(1/\epsilon^2)$.
信息不对称是多代理系统中普遍存在的一种特征,尤其在经济学和社会科学领域表现得尤为明显。在这种背景下,各主体根据私有信息调整行为以最大化自身收益。这种策略性行为往往由于混淆变量的引入而变得复杂。同时,在目标环境中进行实验的难度也带来了知识迁移的重大挑战,这需要将知识从数据更容易获取的环境转移到其他场景中。在此背景下,本文探讨了在线学习中的一个基本问题:我们能否利用非独立同分布(non-i.i.d.)的动作来了解混淆变量,即使在这种情况下仍需实现知识迁移?为此,我们提出了一种样本效率高的算法,旨在准确识别信息不对称条件下的系统动态,并在强化学习框架内有效应对知识转移的挑战,在一个在线策略互动模型下进行。我们的方法可以证明,能够在具有紧致样本复杂度$O(1/\epsilon^2)$的情况下,实现$\epsilon$-最优策略的学习。
https://arxiv.org/abs/2506.09940
In this paper, we propose a novel hierarchical framework for robot navigation in dynamic environments with heterogeneous constraints. Our approach leverages a graph neural network trained via reinforcement learning (RL) to efficiently estimate the robot's cost-to-go, formulated as local goal recommendations. A spatio-temporal path-searching module, which accounts for kinematic constraints, is then employed to generate a reference trajectory to facilitate solving the non-convex optimization problem used for explicit constraint enforcement. More importantly, we introduce an incremental action-masking mechanism and a privileged learning strategy, enabling end-to-end training of the proposed planner. Both simulation and real-world experiments demonstrate that the proposed method effectively addresses local planning in complex dynamic environments, achieving state-of-the-art (SOTA) performance. Compared with existing learning-optimization hybrid methods, our approach eliminates the dependency on high-fidelity simulation environments, offering significant advantages in computational efficiency and training scalability. The code will be released as open-source upon acceptance of the paper.
在这篇论文中,我们提出了一种新的层次化框架,用于在具有异构约束的动态环境中进行机器人导航。我们的方法利用了一个通过强化学习(RL)训练的图神经网络,以高效地估计机器人的代价函数(cost-to-go),这被形式化为局部目标推荐。随后,采用一个考虑了运动学限制的空间-时间路径搜索模块来生成参考轨迹,以便解决用于显式约束执行的非凸优化问题。更重要的是,我们引入了一个增量动作屏蔽机制和一种特权学习策略,使提出的规划器能够进行端到端训练。 模拟实验和真实世界实验均表明,提出的方法有效解决了复杂动态环境中的局部路径规划问题,并达到了最先进的(SOTA)性能水平。与现有的学习-优化混合方法相比,我们的方法消除了对高保真仿真环境的依赖,在计算效率和训练可扩展性方面具有显著优势。 该论文一旦被接受,代码将以开源形式发布。
https://arxiv.org/abs/2506.09859
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at this https URL.
大型推理模型(LRMs)如o1和DeepSeek-R1在处理长链思维过程的自然语言推理方面展现了显著的进步,然而它们在应对复杂数学运算时仍然效率低下或准确性不足。通过计算工具(例如计算库和符号求解器)来解决这些局限性具有前景,但这也引入了一个技术挑战:代码解释器(CI)带来了超出模型内部文本表示的外部知识,因此直接结合使用的效果不佳。本文介绍了一种名为CoRT的后训练框架,旨在教导LRMs有效且高效地利用CI。 作为第一步,我们通过Hint-Engineering解决了数据稀缺问题,这一方法通过在适当位置策略性插入不同的提示来合成代码集成推理数据,并以此优化LRM-CI交互过程。我们手动创建了30个高质量样本,在此基础上,我们将从1.5B到32B参数的模型进行了监督微调、拒绝微调和强化学习后训练。 我们的实验结果表明,采用Hint-Engineering的模型在DeepSeek-R1-Distill-Qwen-32B和DeepSeek-R1-Distill-Qwen-1.5B上分别取得了4%和8%的绝对改进,在五个具有挑战性的数学推理数据集上的表现尤为突出。此外,相对于纯自然语言模型,Hint-Engineering的模型在32B模型中使用的令牌减少了约30%,而在1.5B模型中则减少了一半左右。 该研究的相关代码与模型可通过此链接获取:[请在此处插入实际提供的URL]
https://arxiv.org/abs/2506.09820
End-to-end autonomous driving has emerged as a promising paradigm for directly mapping sensor inputs to planning maneuvers using learning-based modular integrations. However, existing imitation learning (IL)-based models suffer from generalization to hard cases, and a lack of corrective feedback loop under post-deployment. While reinforcement learning (RL) offers a potential solution to tackle hard cases with optimality, it is often hindered by overfitting to specific driving cases, resulting in catastrophic forgetting of generalizable knowledge and sample inefficiency. To overcome these challenges, we propose Reinforced Refinement with Self-aware Expansion (R2SE), a novel learning pipeline that constantly refines hard domain while keeping generalizable driving policy for model-agnostic end-to-end driving systems. Through reinforcement fine-tuning and policy expansion that facilitates continuous improvement, R2SE features three key components: 1) Generalist Pretraining with hard-case allocation trains a generalist imitation learning (IL) driving system while dynamically identifying failure-prone cases for targeted refinement; 2) Residual Reinforced Specialist Fine-tuning optimizes residual corrections using reinforcement learning (RL) to improve performance in hard case domain while preserving global driving knowledge; 3) Self-aware Adapter Expansion dynamically integrates specialist policies back into the generalist model, enhancing continuous performance improvement. Experimental results in closed-loop simulation and real-world datasets demonstrate improvements in generalization, safety, and long-horizon policy robustness over state-of-the-art E2E systems, highlighting the effectiveness of reinforce refinement for scalable autonomous driving.
端到端自动驾驶作为一种有前景的范式,直接将传感器输入映射为规划操作,利用基于学习的方法进行模块化集成。然而,现有的基于模仿学习(IL)的模型在面对困难案例时难以泛化,并且缺乏部署后的纠正反馈循环。虽然强化学习(RL)提供了一种潜在解决方案来解决这些难题案例并实现最优解,但其通常会因为过度适应特定驾驶情况而受到阻碍,从而导致对通用知识的记忆丧失和样本效率低下。为了克服这些挑战,我们提出了自我感知扩展的增强型细化方法(R2SE),这是一种新型学习管道,能够不断优化困难领域的同时保持模型无关端到端驾驶系统的可泛化驾驶策略。通过强化微调和支持持续改进的政策扩张,R2SE具有三个关键组成部分:1)硬案例分配的一般预训练,可以培训一个基于模仿学习(IL)的通用自动驾驶系统,并动态识别易于失败的情况进行有针对性的细化;2)残差增强专家精炼,使用强化学习优化残差校正以提高在困难领域的性能,同时保持整体驾驶知识;3)自我感知适配器扩展,能够将专业策略动态集成回通用模型中,从而增强持续性的表现改进。闭合循环模拟和真实世界数据集的实验结果表明,在泛化、安全性和长视距政策稳健性方面优于最先进的E2E系统,这突显了通过强化学习进行细化对于可扩展自动驾驶的有效性。
https://arxiv.org/abs/2506.09800
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
AI生成内容的创建流程已经从单一模型演进到了模块化工作流,特别是在ComfyUI这样的平台上,使得创意管线中的定制化成为可能。然而,构建有效的这些工作流需要大量的专业知识来协调众多的专业组件,这给用户带来了陡峭的学习曲线。为了解决这一挑战,我们推出了ComfyUI-R1,这是首个用于自动化工作流生成的大规模推理模型。 我们的工作是从精心挑选的4000个工作流的数据集开始,构建了长链条思维(CoT)推理数据,包括节点选择、工作流程规划以及代码级别的工作流表示。ComfyUI-R1通过两阶段框架进行训练:(1) 在初始适应时进行长链思维细化调优,使模型能够针对ComfyUI领域;(2) 采用细粒度规则-指标混合奖励引导的强化学习来激励推理能力,确保格式有效性、结构完整性以及节点级精准度。 实验结果表明,我们70亿参数量的模型在格式有效性方面达到了97%的成功率,并且拥有高通过率和节点级及图级别F1得分,显著超越了使用如GPT-4o和Claude系列等领先封闭源代码模型的方法。进一步分析强调了推理过程的关键作用以及将工作流转换为代码的优势。定性比较显示我们在合成具有多样化节点的复杂工作流程方面的优势,突显出长链思维在AI艺术创作中的潜力。
https://arxiv.org/abs/2506.09790
Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we propose Intent Factored Generation (IFG), factorising the sampling process into two stages. First, we sample a semantically dense intent, e.g., a summary or keywords. Second, we sample the final response conditioning on both the original prompt and the intent from the first stage. This allows us to use a higher temperature during the intent step to promote conceptual diversity, and a lower temperature during the final generation to ensure the outputs are coherent and self-consistent. Additionally, we find that prompting the model to explicitly state its intent for each step of the chain-of-thought before generating the step is beneficial for reasoning tasks. We demonstrate our method's effectiveness across a diverse set of tasks. We show this method improves both pass@k and Reinforcement Learning from Verifier Feedback on maths and code tasks. For instruction-tuning, we combine IFG with Direct Preference Optimisation to increase conversational diversity without sacrificing reward. Finally, we achieve higher diversity while maintaining the quality of generations on a general language modelling task, using a new dataset of reader comments and news articles that we collect and open-source. In summary, we present a simple method of increasing the sample diversity of LLMs while maintaining performance. This method can be implemented by changing the prompt and varying the temperature during generation, making it easy to integrate into many algorithms for gains across various applications.
从固定提示中获取多个有意义且高质量的样本,仍然是大型语言模型面临的一个开放性挑战。目前增加多样性的方法通常只在词汇级别上操作,即对相同的响应进行改写。这会产生问题,因为这种做法会导致推理任务中的探索不足,并导致对话代理变得乏味和重复。为了应对这一挑战,我们提出了意图分解生成(IFG)方法,将采样过程分为两个阶段。首先,我们在较高的温度下抽样一个语义密集的意图,如摘要或关键词,以促进概念多样性;其次,在第二阶段中,我们将根据第一阶段得到的意图以及原始提示来生成最终响应,并且在这个阶段使用较低的温度以确保输出的一致性和自洽性。 我们发现,让模型在每一步推理链之前显式声明其意图对推理任务有益。我们在一系列多样化任务上展示了该方法的有效性:对于数学和编程任务,IFG改进了pass@k和基于验证器反馈的强化学习性能;在指令微调中,我们将IFG与直接偏好优化相结合,在不牺牲奖励的情况下增加了对话多样性;最后,在一个包含读者评论和新闻文章的新数据集上,我们实现了更高的样本多样性,同时保持了生成的质量。 总之,我们提出了一种简单的方法来增加大型语言模型的采样多样性,同时维持其性能。该方法可以通过修改提示以及在生成过程中调整温度来实现,并且可以轻松地集成到许多算法中,在各种应用场景中获得改进效果。
https://arxiv.org/abs/2506.09659
Dynamic locomotion of legged robots is a critical yet challenging topic in expanding the operational range of mobile robots. It requires precise planning when possible footholds are sparse, robustness against uncertainties and disturbances, and generalizability across diverse terrains. While traditional model-based controllers excel at planning on complex terrains, they struggle with real-world uncertainties. Learning-based controllers offer robustness to such uncertainties but often lack precision on terrains with sparse steppable areas. Hybrid methods achieve enhanced robustness on sparse terrains by combining both methods but are computationally demanding and constrained by the inherent limitations of model-based planners. To achieve generalized legged locomotion on diverse terrains while preserving the robustness of learning-based controllers, this paper proposes to learn an attention-based map encoding conditioned on robot proprioception, which is trained as part of the end-to-end controller using reinforcement learning. We show that the network learns to focus on steppable areas for future footholds when the robot dynamically navigates diverse and challenging terrains. We synthesize behaviors that exhibit robustness against uncertainties while enabling precise and agile traversal of sparse terrains. Additionally, our method offers a way to interpret the topographical perception of a neural network. We have trained two controllers for a 12-DoF quadrupedal robot and a 23-DoF humanoid robot respectively and tested the resulting controllers in the real world under various challenging indoor and outdoor scenarios, including ones unseen during training.
腿部机器人动态运动是扩展移动机器人操作范围的一个关键但具有挑战性的课题。它要求在可能的立足点稀疏的情况下进行精确规划,在面对不确定性和干扰时保持鲁棒性,并且能够在各种地形上通用化。虽然传统的基于模型的控制器在复杂的地形规划中表现出色,但在处理现实世界中的不确定性方面却存在困难。而基于学习的控制器可以应对这些不确定性带来的挑战,但通常在稀疏立足点区域上的精确度不足。混合方法通过结合两种方法来提高在稀疏地形上的鲁棒性,但这需要大量的计算资源,并且受限于基于模型规划器本身的限制。 为了实现在不同地形上通用化的腿部运动同时保持学习型控制器的鲁棒性,本文提出了一种新的方法:根据机器人的内感觉(本体感受)训练一个注意力机制的地图编码,在整个控制器中通过强化学习进行端到端的学习。我们展示了网络如何在机器人动态穿越各种复杂和具有挑战性的地形时,学会关注未来立足点的可踏区域。这种方法合成的行为展现出对抗不确定性的鲁棒性,并且能够实现稀疏地形上精确而敏捷的移动。此外,我们的方法提供了一种解释神经网络拓扑感知的方式。 我们分别针对一个12自由度(DoF)的四足机器人和一个23自由度的人形机器人训练了两个控制器,并在各种具有挑战性的室内和室外场景中对这些控制器进行了测试,包括一些在训练期间未曾见过的情况。
https://arxiv.org/abs/2506.09588
We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in practical deployment as feedback can be gathered concurrently. Empirical evaluation confirms that the proposed method is competitive with reward-based reinforcement learning while requiring a small number of preference queries.
我们在一般马尔可夫决策过程中研究基于人类反馈的强化学习,其中智能体通过轨迹层面的偏好比较进行学习。在这种设置下,设计能够选择信息量丰富且能确保理论保障的偏好查询算法是一个核心挑战。我们提出了一种基于随机探索的元算法,这种方法避免了乐观方法所带来的计算难题,并保持可操作性。我们在温和的强化学习预言假设下建立了既有遗憾(regret)又有末态保证。 为了改进查询复杂度,我们引入并分析了一个改进的算法,该算法收集轨迹对的批次并使用最优实验设计选择信息量丰富的比较查询。这种批量结构还使得偏好查询可以并行化,在实际部署中具有相关性,因为反馈可以在同时收集。经验评估证实了所提出的方法在需要少量偏好查询的情况下与基于奖励的强化学习方法竞争。 具体翻译如下: 我们研究了一般马尔可夫决策过程中的基于人类反馈的强化学习问题,其中智能体通过轨迹级别的偏好比较来进行学习。在这种情况下的一个主要挑战是设计能够选择信息量丰富且能确保理论保障的偏好查询算法来识别潜在奖励。为此,我们提出了一种基于随机探索的元算法,该方法避免了乐观策略带来的计算难题,并保持操作上的可行性。我们在温和的强化学习预言假设下建立了遗憾和末态保证。 为了进一步改进查询复杂度,我们引入并分析了一个改进的算法,它通过收集轨迹对的批次来工作,并应用最优实验设计选择信息量丰富的比较查询。这种批量结构还使得偏好查询可以并行化,在实际部署中具有相关性,因为反馈可以在同时获取。经验评估确认了所提出的这种方法在需要少量偏好查询的情况下与基于奖励的强化学习方法竞争。
https://arxiv.org/abs/2506.09508
We introduce Option Kernel Bellman Equations (OKBEs) for a new reward-free Markov Decision Process. Rather than a value function, OKBEs directly construct and optimize a predictive map called a state-time option kernel (STOK) to maximize the probability of completing a goal while avoiding constraint violations. STOKs are compositional, modular, and interpretable initiation-to-termination transition kernels for policies in the Options Framework of Reinforcement Learning. This means: 1) STOKs can be composed using Chapman-Kolmogorov equations to make spatiotemporal predictions for multiple policies over long horizons, 2) high-dimensional STOKs can be represented and computed efficiently in a factorized and reconfigurable form, and 3) STOKs record the probabilities of semantically interpretable goal-success and constraint-violation events, needed for formal verification. Given a high-dimensional state-transition model for an intractable planning problem, we can decompose it with local STOKs and goal-conditioned policies that are aggregated into a factorized goal kernel, making it possible to forward-plan at the level of goals in high-dimensions to solve the problem. These properties lead to highly flexible agents that can rapidly synthesize meta-policies, reuse planning representations across many tasks, and justify goals using empowerment, an intrinsic motivation function. We argue that reward-maximization is in conflict with the properties of compositionality, modularity, and interpretability. Alternatively, OKBEs facilitate these properties to support verifiable long-horizon planning and intrinsic motivation that scales to dynamic high-dimensional world-models.
我们提出了Option Kernel Bellman Equations (OKBE),用于一个新的无奖励的马尔可夫决策过程(MDP)。与传统的价值函数不同,OKBE直接构造并优化一种称为状态-时间选项核(STOK)的预测映射,旨在最大化完成目标的概率同时避免约束违反。在强化学习中的Options Framework中,STOKs是组合型、模块化且易于解释的从初始化到终止的转移核。这意味着: 1. STOK可以通过使用Chapman-Kolmogorov方程进行组合,从而为多个策略提供时空预测,并适用于长时间范围。 2. 高维的STOK可以采用因子化和可重构的形式高效表示与计算。 3. STOK记录了语义上易于解释的目标达成概率和约束违反事件的概率,这对于正式验证是必要的。 对于一个高维的状态转移模型,若该问题本身难以解决,则可以通过局部STOKs和目标条件策略分解这个复杂模型,并将这些策略聚合为因子化目标核。这使得在高维度下以目标级别进行前向规划成为可能,从而解决了原本的问题。这些特性导致了高度灵活的智能体,它们能够迅速合成元策略、跨多个任务重用计划表示,并利用内在动机函数(如赋权能力)来合理化目标。 我们主张,最大化奖励与组合性、模块性和可解释性的性质相冲突。相比之下,OKBE促进了这些性质的支持,从而有助于实现适用于动态高维世界模型的可验证长期规划和内在动机机制。
https://arxiv.org/abs/2506.09499
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.
在上下文学习(ICL)这一主导的指令学习趋势中,通过提供明确的任务指导和示例来增强大型语言模型的表现力,提高它们在任务理解和执行方面的能力。本文研究了大规模视觉-语言模型(LVLM)中的ICL,并探索多模态展示选择策略。现有的ICL研究工作面临重大挑战:首先,它们依赖于预定义的演示或基于人类直觉的手工选择策略,这些通常不足以涵盖多样化的任务需求,导致次优解决方案;其次,单独选择每个演示无法建模它们之间的相互作用,从而造成信息冗余。与这些流行的方法不同,我们提出了一种新的探索-利用强化学习框架,该框架旨在探索融合多模态信息并适应性地整体选择足够演示的策略。这个框架允许LVLM通过自我探索不断优化其展示,具备自主识别和生成最佳上下文学习选择政策的能力。实验结果验证了我们的方法在四个视觉问答(VQA)数据集上的优越性能,证明了它能够增强少样本LVLMs的一般化能力。
https://arxiv.org/abs/2506.09473
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.
直接对齐算法(DAAs),如直接偏好优化(DPO)和简单偏好优化(SimPO),已作为强化学习从人类反馈(RLHF)算法的高效替代方案出现,用于将大型语言模型(LLMs)与人类偏好对齐。然而,我们发现DAAs存在一个根本性的限制,即“奖励生成差距”——在训练期间优化目标和推理过程中实际生成性能之间的不一致。 在这篇论文中,我们确定了导致奖励生成差距的因素之一是LLM生成过程中前缀标记的内在重要性与DAAs隐式奖励函数所反映的重要程度之间存在的不匹配。为了弥合这一差距,我们提出了一种简单但有效的方法——前缀导向等长训练(POET),它将优选和非优选响应截断为与较短的一个相同长度。 使用POET进行训练时,每个样本中的两个响应都被截断至相等的长度,这导致了跨样本的不同截断长度。通过这种方法,DAAs目标的优化隐式地被约束在所有位置上收敛,因此比标准DAAs更加关注前缀标记。 我们对两种典型的DAAs——DPO和SimPO进行了实验,结果表明,POET优于它们的标准实现,在AlpacaEval 2测试中提高了高达15.6分,并且在下游任务中的整体性能也有提升。我们的研究强调了需要解决DAAs中奖励优化与生成表现之间不一致的重要性。 简而言之,这项工作揭示并缓解了直接对齐算法(DAAs)的一个重要限制,通过引入前缀导向等长训练(POET),显著提升了模型在多项任务上的性能和一致性。
https://arxiv.org/abs/2506.09457
The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, our goal is to study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents' effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneous Environment Design (HED), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Experiments in matrix games and an embodied Multi-Goal-Capture environment show that, despite the difference in settings, HED rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HED and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
机器人技术、自然和社会中的团队成功往往依赖于不同专业人员之间的分工合作;然而,对于为何多样性会超过同质化团队的原理性解释仍然缺失。我们专注于多智能体任务分配问题,并从奖励设计的角度来探讨这一问题:什么样的目标最适合异质团队?首先考虑一个瞬时而非空间的情境,在此情境中,全局奖励是由两个广义聚合算子构建而成的:内部算子将N个代理在个体任务上的努力分配映射为任务评分,外部算子则将M项任务评分合并成全局团队奖励。我们证明了这些操作符的曲率决定了异质性是否能够增加回报,并且对于广泛的奖励家族而言,这简化成了一个简单的凸性测试。 接下来,当实体化的时间延长的智能体必须学习努力分配策略时,什么会激励异质性的出现?为了研究这种设置下的异质性,我们采用多智能体强化学习(MARL)作为计算范式,并引入了异质环境设计(HED),这是一个基于梯度的算法,它优化了未指定的MARL环境的参数空间以找到异质性具有优势的情景。矩阵游戏和实体化的多目标捕获环境中进行的实验表明,尽管在设置上有所不同,但HED能够重新发现由我们的理论预测的可以最大化异质性优势的奖励制度,这既验证了HED的有效性,又将我们的理论见解与MARL中的奖励设计联系起来。 这些结果共同帮助我们理解行为多样性何时能带来可测量的好处。
https://arxiv.org/abs/2506.09434
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at this https URL.
自主代理的出现通过使用自然语言作为强有力的中介,正在改变与图形用户界面(GUI)互动的方式。尽管当前用于实现空间定位的GUI代理主要依赖于监督微调(SFT)方法,但这些方法由于难以准确感知位置数据而面临重大挑战。现有的策略,如强化学习,往往无法有效评估位置准确性,从而限制了它们的应用范围。为此,我们引入了一种新的方法——位置偏好优化(LPO),该方法利用位置数据来优化互动偏好。LPO通过关注信息丰富的区域并使用信息熵来预测互动位置。此外,它还基于物理距离提出了一个动态的位置奖励函数,体现了不同交互位置的重要性差异。借助于组相对偏好优化(GRPO)的支持,LPO促进了GUI环境的广泛探索,并显著提高了交互精度。全面的实验表明,LPO表现出色,在离线基准测试和现实世界的在线评估中均达到了最先进的结果。我们的代码将在不久后公开发布,详情请见此链接:[https URL]。
https://arxiv.org/abs/2506.09373
Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender, a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills, and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering. We also introduce SkillBench, a parallel, cross-embodiment, and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks, accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking, resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: this https URL.
人形机器人由于其灵活性和类人的形态,在各种环境中完成日常任务方面具有巨大潜力。近期的研究在人形机器人的全身控制和移动操作(loco-manipulation)领域取得了显著进展,这些研究利用了最优控制或强化学习技术。然而,这些方法需要为每个特定任务进行繁琐的调参才能达到满意的性能表现,从而限制了它们处理日常场景中多样任务时的灵活性和可扩展性。 为此,我们引入了一种新的分层强化学习框架——SkillBlender,旨在实现人形机器人的多功能移动操作。SkillBlender首先预训练目标导向的任务无关的基础技能(primitive skills),然后动态地将这些基础技能进行组合,以完成复杂的移动操作任务,并且仅需少量针对特定任务的奖励工程就能达成此目的。 此外,我们还引入了一个名为SkillBench的新基准测试平台,它是一个并行、跨实体形态和多样化的人形机器人模拟环境。SkillBench包含了三种不同的身体模型、四种基础技能以及八个具有挑战性的移动操作任务,并提供了一套科学评估指标来平衡准确性和可行性。通过广泛的模拟实验,我们证明了该方法在所有基线之上显著提高了性能,同时自然地对行为进行了规范化以避免奖励欺骗,从而为日常场景中的多样化移动操作任务提供了更精确和可行的动作。 我们的代码和基准测试平台将开源给研究社区,以促进未来的研究进展。项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2506.09366
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at this https URL.
强化学习(RL)对于优化大型语言模型(LLMs)至关重要。最近的组相对策略优化(GRPO)通过每个提示使用多个在线策略输出来估计优势,这导致了高昂的计算成本和较低的数据效率。为了解决这个问题,我们引入了一种增强型策略优化方法(Replay-Enhanced Policy Optimization, RePO),它利用多种回放策略从回放缓冲区中检索出策略样本,使得每个提示都可以基于更广泛且多样化的样本来进行策略优化。 在五个LLMs和七个数学推理基准测试上的实验表明,与GRPO相比,RePO使Qwen2.5-Math-1.5B的绝对平均性能提升了18.4个点,使Qwen3-1.7B的绝对平均性能提升了4.1个点。进一步分析显示,对于设置有每个提示在线策略和离线策略样本数均为8的Qwen3-1.7B模型,RePO将计算成本提高了15%,同时优化步骤的有效数量增加了48%。 该研究的相关代码库可以在以下链接访问:[https URL](请替换为实际URL)。
https://arxiv.org/abs/2506.09340
Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.
现代大型语言模型(LLMs)在复杂的自然语言任务中展现出了令人印象深刻的零样本和少量样本泛化能力,使其能够广泛用作翻译和摘要等多样化应用的虚拟助手。尽管这些模型仅通过大规模文本语料库进行训练而未明确监督作者意图,它们似乎能够推断出文本交互背后的含义。这引发了这样一个基本问题:LLMs 是否具备推理他人意图的能力,即是否拥有某种形式的心智理论?理解他人的意图对于有效合作至关重要,这是人类社会成功的基础,并且在包括人和自主系统在内的多代理协同互动中也是必不可少的。在这项工作中,我们通过合作型多智能体强化学习(MARL)来探讨LLMs中的心智理论问题,在这种环境中,代理通过重复交互学习如何协作,这反映了人类的社会推理过程。我们的方法旨在增强人工代理适应并与其它人工智能代理和人类伙伴进行合作的能力。通过利用能够进行自然语言互动的基于LLM的代理,我们朝着创建可以促进无缝协同工作的混合人机系统迈进了一步,这对于未来的人类与AI交互具有深远的影响。
https://arxiv.org/abs/2506.09331
This paper presents a state representation framework for Markov decision processes (MDPs) that can be learned solely from state trajectories, requiring neither reward signals nor the actions executed by the agent. We propose learning the minimum action distance (MAD), defined as the minimum number of actions required to transition between states, as a fundamental metric that captures the underlying structure of an environment. MAD naturally enables critical downstream tasks such as goal-conditioned reinforcement learning and reward shaping by providing a dense, geometrically meaningful measure of progress. Our self-supervised learning approach constructs an embedding space where the distances between embedded state pairs correspond to their MAD, accommodating both symmetric and asymmetric approximations. We evaluate the framework on a comprehensive suite of environments with known MAD values, encompassing both deterministic and stochastic dynamics, as well as discrete and continuous state spaces, and environments with noisy observations. Empirical results demonstrate that the proposed approach not only efficiently learns accurate MAD representations across these diverse settings but also significantly outperforms existing state representation methods in terms of representation quality.
本文提出了一种针对马尔可夫决策过程(MDPs)的状态表示框架,该框架仅从状态轨迹中学习,不需要奖励信号或智能体执行的动作。我们提出了最小动作距离(MAD)的学习方法,即在两个状态之间转换所需的最少操作数,作为一种基本度量标准,捕捉环境的基本结构。MAD自然支持目标条件下的强化学习和奖励塑造等关键下游任务,通过提供一种密集且几何意义明确的进度测量方式来实现这一目的。 我们的自监督学习方法构建了一个嵌入空间,在该空间中,嵌入状态对之间的距离对应于它们的MAD,从而可以适应对称或非对称近似。我们在一系列具有已知MAD值的环境中评估了此框架的有效性,这些环境包括确定性和随机动态、离散和连续的状态空间以及存在噪声观测的情况。 实证结果表明,所提出的方法不仅在上述多样化的设置中高效地学习出准确的MAD表示,而且其状态表示质量也显著优于现有的方法。
https://arxiv.org/abs/2506.09276
We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based adaptive pooling method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.
我们研究了用于总结转换器嵌入模型输出的池化方法的设计,主要动机是强化学习和视觉应用。这项工作考虑的是,在输入向量的一个子集中包含执行下游任务所需的信息(信号)的同时,其余部分则是干扰信息(噪声)。通过将池化视为矢量量化,并以最小化信号损失为目标,我们展示了常用的聚合转换器输出的方法——AvgPool、MaxPool 和 ClsToken 在输入的信噪比(SNR)波动时容易导致性能崩溃。然后我们证明了一种基于注意力机制的自适应池化方法可以在任何 SNR 下逼近最优矢量量化器,并且其误差范围可以通过推导得出。我们的理论结果首先通过在设计用于隔离 SNR 问题的合成数据集上的监督实验进行验证,随后推广到标准的关系推理、多智能体强化学习以及具有噪声观测值的视觉基准测试中,在这些任务中使用自适应池化的转换器表现出了更好的鲁棒性。
https://arxiv.org/abs/2506.09215