Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
强化学习从人类反馈(RLHF)已经在将大型语言模型(LLMs)与人类偏好对齐方面取得了有效性。然而,在长序列中,词级 RLHF 受到过时的奖励问题困扰,这使得模型难以分辨哪些动作导致了成功结果。这阻碍了学习效率并减慢了收敛速度。在本文中,我们提出了 MA-RLHF,一种简单而有效的 RLHF 框架,它将宏观行动——词或更高层次的语言构建的序列——融入到学习过程中。通过操作在更高抽象层次上,我们的方法减少了动作和奖励之间的时间间隔,促进了更快和更准确的奖励分配。这导致每个 episode 内的学习效率提高,同时训练或推理过程中的计算复杂度没有增加。我们通过在各种模型大小和任务上进行广泛的实验来验证我们的方法。我们的方法在文本摘要、对话生成、问题回答和程序合成等任务上取得了超过标准 RLHF 的显著性能提升。值得注意的是,我们的方法与普通 RLHF 的性能提升达到 30% 的效果。在训练时间上,我们的方法甚至比普通 RLHF 快 2 倍,并且在进一步训练后仍然表现出优异的性能。我们将把我们的代码和数据公开发布在本文的链接上。
https://arxiv.org/abs/2410.02743
Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi-agent decision-making problems because they miss the trial-and-error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language-guided simulator into the multi-agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi-agent decision-making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi-Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.
近年来在生成模型方面的进步,已经在许多领域引起了显著的创新,如图像生成和聊天机器人。尽管这些模型取得了成功,但它们通常在复杂的多智能体决策问题中产生零乱和不准确的解决方案,因为它们无法体验和推理,就像人类一样。为了应对这个局限,我们探讨了一个将语言指导的模拟器整合到多智能体强化学习流程中的范例,以增强生成的答案。模拟器是一个世界模型,它分别学习动态和奖励,其中动态模型包括图像标记者和因果变换器,以生成交互转移自回归,而奖励模型是通过最大化在语言指导下的专家演示轨迹的概率来学习的双向变换器。给出现有状态的图像和任务描述,我们使用世界模型来训练联合策略,并通过在动态模型上运行收敛策略来生成图像序列作为答案。 实证结果表明,这种框架可以通过在 StarCraft Multi-Agent Challenge 基准训练和未见过的任务上表现出卓越的性能来提高多智能体决策问题的答案。特别是,它可以在交互状态下生成一致的交互序列,并解释可解释的奖励函数,为未来训练生成模型开辟了道路。
https://arxiv.org/abs/2410.02664
The widely used expected utility theory has been shown to be empirically inconsistent with human preferences in the psychology and behavioral economy literatures. Cumulative Prospect Theory (CPT) has been developed to fill in this gap and provide a better model for human-based decision-making supported by empirical evidence. It allows to express a wide range of attitudes and perceptions towards risk, gains and losses. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem where the goal of the agent is to search for a policy generating long-term returns which are aligned with their preferences. In this work, we revisit this policy optimization problem and provide new insights on optimal policies and their nature depending on the utility function under consideration. We further derive a novel policy gradient theorem for the CPT policy optimization objective generalizing the seminal corresponding result in standard RL. This result enables us to design a model-free policy gradient algorithm to solve the CPT-RL problem. We illustrate the performance of our algorithm in simple examples motivated by traffic control and electricity management applications. We also demonstrate that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
预期效用理论在心理学和行为经济学文献中已经被证明与人类偏好存在经验上的不一致性。为了填补这一空白并提供支持基于实证证据的人类决策模型的更好模型,累积前景理论(CPT)被开发出来。它允许表达对于风险、收益和损失的广泛态度和观念。几年前,CPT与强化学习(RL)相结合,形成了一个CPT政策优化问题,其中代理者的目标是寻找一个政策,该政策产生的长期回报与他们的偏好相一致。在这篇工作中,我们重新审视了这个政策优化问题,并提供了关于随着考虑的效用函数最优策略和其性质的新洞察。我们进一步推导出了一种新的CPT政策优化目标函数,将其扩展了标准RL中的相关结果。这一结果使我们能够设计一个模型无关的政策梯度算法来解决CPT-RL问题。我们用交通控制和电力管理等简单例子来展示我们算法的性能。我们还证明了我们的政策梯度算法在解决相同问题时的扩展性比现有零阶算法更好。
https://arxiv.org/abs/2410.02605
Multi-Agent Reinforcement Learning (MARL) struggles with sample inefficiency and poor generalization [1]. These challenges are partially due to a lack of structure or inductive bias in the neural networks typically used in learning the policy. One such form of structure that is commonly observed in multi-agent scenarios is symmetry. The field of Geometric Deep Learning has developed Equivariant Graph Neural Networks (EGNN) that are equivariant (or symmetric) to rotations, translations, and reflections of nodes. Incorporating equivariance has been shown to improve learning efficiency and decrease error [ 2 ]. In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration-enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x-5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi-agent systems.
多智能体强化学习(MARL)在样本效率和泛化能力方面存在挑战[1]。这些挑战部分是由于通常用于学习策略的神经网络缺乏结构或归纳偏见而导致的。在多智能体场景中,常观察到的一种结构形式是对称性。图深度学习领域已经开发出等价(或对称)于节点旋转、平移和反射的等价图神经网络(EGNN)。引入等价性已经被证明可以提高学习效率并降低误差[2]。 在本文中,我们证明了EGNN在MARL中提高了样本效率和泛化能力。然而,我们还证明了EGNN的直接应用到MARL中会导致早期的探索性较差,因为EGNN的结构存在偏差。为了减轻这种偏差,我们提出了增强探索的等价图神经网络或E2GN2。我们使用常见的MARL基准测试MPE和SMACv2与其他常用的函数近似器进行比较。 在通用测试中,E2GN2证明了在样本效率、最终奖励收敛率以及扩展GNNs方面 significantly 改善。这些结果为在复杂的多智能体系统中有更可靠和有效的解决方案铺平了道路。
https://arxiv.org/abs/2410.02581
With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.
在具有广泛的国家行动空间的情况下,高效的 multi- 代理探索仍然是强化学习中的一个长期挑战。尽管追求新颖性、多样性和不确定性吸引了越来越多的关注,但盲目探索带来的冗余努力对社区来说是一个实际问题。本文介绍了一种系统方法,称为 LEMAE,选择从知识渊博的大型语言模型 (LLM) 中获取有关高效 multi- 代理探索的有用任务相关指导。具体来说,我们在低 LLM 推理成本的条件下,将语言知识从 LLM grounding into symbolic key states,这些状态对于任务完成至关重要。为了释放关键状态的力量,我们设计了基于子空间的 Hindsight Intrinsic Reward (SHIR),通过增加奖励密度来引导代理器到达关键状态。此外,我们还构建了用于特定任务的关键状态记忆树 (KSMT),以跟踪关键状态之间的转移。通过减少冗余探索,LEMAE 在具有挑战性的基准(如 SMAC 和 MPE)上超过了现有 SOTA 方法,实现了某些场景下的 10 倍加速。
https://arxiv.org/abs/2410.02511
Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored. In this work, we study the learning of cross-embodiment dexterous grasping policies using reinforcement learning (RL). Inspired by the capability of human hands to control various dexterous hands through teleoperation, we propose a universal action space based on the human hand's eigengrasps. The policy outputs eigengrasp actions that are then converted into specific joint actions for each robot hand through a retargeting mapping. We simplify the robot hand's proprioception to include only the positions of fingertips and the palm, offering a unified observation space across different robot hands. Our approach demonstrates an 80% success rate in grasping objects from the YCB dataset across four distinct embodiments using a single vision-based policy. Additionally, our policy exhibits zero-shot generalization to two previously unseen embodiments and significant improvement in efficient finetuning. For further details and videos, visit our project page this https URL.
灵活的手展示了在复杂现实世界的抓取任务中具有显著的潜力。 虽然最近的研究主要集中在为特定机器人手的学习策略,但开发一个控制各种灵巧手 universally policies 仍然是一个未被探索的问题。在这项工作中,我们使用强化学习 (RL) 研究跨身体抓取策略的学习。受到人类手通过遥控控制各种灵巧手的能力的启发,我们基于人类手的 eigengrasps 提出了一个通用动作空间。 策略输出 eigengrasp 动作,然后通过重新定位映射将其转换为每个机器人手的特定关节动作。 我们简化机器人的本体感知,仅包括手指的位置和手掌,为不同机器人手提供了一个统一的数据空间。我们的方法在用单个视觉基础策略从 YCB 数据集中抓取物体时取得了80%的成功率。此外,我们的策略展示了零样本泛化到之前未见过的两个灵巧手,并且在有效微调方面取得了显著的改进。更多细节和视频,请访问我们的项目页面,这个链接:
https://arxiv.org/abs/2410.02479
Bimanual dexterous manipulation is a critical yet underexplored area in robotics. Its high-dimensional action space and inherent task complexity present significant challenges for policy learning, and the limited task diversity in existing benchmarks hinders general-purpose skill development. Existing approaches largely depend on reinforcement learning, often constrained by intricately designed reward functions tailored to a narrow set of tasks. In this work, we present a novel approach for efficiently learning diverse bimanual dexterous skills from abundant human demonstrations. Specifically, we introduce BiDexHD, a framework that unifies task construction from existing bimanual datasets and employs teacher-student policy learning to address all tasks. The teacher learns state-based policies using a general two-stage reward function across tasks with shared behaviors, while the student distills the learned multi-task policies into a vision-based policy. With BiDexHD, scalable learning of numerous bimanual dexterous skills from auto-constructed tasks becomes feasible, offering promising advances toward universal bimanual dexterous manipulation. Our empirical evaluation on the TACO dataset, spanning 141 tasks across six categories, demonstrates a task fulfillment rate of 74.59% on trained tasks and 51.07% on unseen tasks, showcasing the effectiveness and competitive zero-shot generalization capabilities of BiDexHD. For videos and more information, visit our project page this https URL.
熟练操作双手的机器人技术是一个关键但尚未得到充分探索的领域。其高维动作空间和固有任务复杂性对策略学习造成了重大挑战,而现有基准测试中任务的多样性有限,阻碍了通用技能的发展。现有的方法很大程度上依赖于强化学习,通常受到针对狭窄任务的精心设计奖励函数的限制。在这项工作中,我们提出了一个学习丰富多手熟练技能的新方法,可以从丰富的人体演示中有效地学习。具体来说,我们引入了BiDexHD,一个统一任务构建现有多手数据集和采用师生策略学习解决所有任务的框架。教师使用共享行为任务的一般两阶段奖励函数学习状态基于策略,而学生将学到的多任务策略精炼为基于视觉的策略。使用BiDexHD,从自构建任务中学习成千上万个多手熟练技能变得可行,为通用多手熟练操作提供了有益的进展。我们对TACO数据集的实证评估,涵盖了六个类别的141个任务,展示了BiDexHD在训练任务上的任务完成率为74.59%,在未见任务上的任务完成率为51.07%,展示了BiDexHD的有效性和竞争零散分布能力。对于视频和其他更多信息,请访问我们的项目页面,该页面链接为https://www. this URL。
https://arxiv.org/abs/2410.02477
Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task learning and limited generalization to unseen objects. To overcome these challenges, we introduce ResDex, a novel approach that integrates residual policy learning with a mixture-of-experts (MoE) framework. ResDex is distinguished by its use of geometry-unaware base policies that are efficiently acquired on individual objects and capable of generalizing across a wide range of unseen objects. Our MoE framework incorporates several base policies to facilitate diverse grasping styles suitable for various objects. By learning residual actions alongside weights that combine these base policies, ResDex enables efficient multi-task RL for universal dexterous grasping. ResDex achieves state-of-the-art performance on the DexGraspNet dataset comprising 3,200 objects with an 88.8% success rate. It exhibits no generalization gap with unseen objects and demonstrates superior training efficiency, mastering all tasks within only 12 hours on a single GPU.
通用且灵巧的抓取跨越多样物体,对机器人学习是一个基本但困难的挑战。使用强化学习(RL)开发策略来处理广泛物体数据集现有方法面临着关键限制,包括多任务学习复杂的课程设计和对未见过的物体的泛化能力有限。为了克服这些挑战,我们引入了ResDex,一种将残差策略学习与专家混合(MoE)框架相结合的新颖方法。ResDex的特点在于其使用几何感知的基础策略,在单个物体上以高效的方式获得,并能够跨越广泛的未见过的物体。我们的MoE框架包括几个基础策略,以促进各种抓取风格,适应该些物体。通过与这些基础策略一起学习残差动作,ResDex实现了通用灵巧抓取的 efficient multi-task RL。ResDex在由3,200个物体组成的DexGraspNet数据集上取得了最先进的性能,成功率为88.8%。它与未见过的物体没有泛化差距,并展示了在单块GPU上训练的高效率,精通所有任务,仅用12个小时。
https://arxiv.org/abs/2410.02475
Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at: this https URL.
安全和成功的机器人部署需要不仅具备生成复杂计划的能力,还需要频繁重新规划和纠正执行错误的能力。本文解决了在撤退视野方法下,长视野轨迹规划中的挑战。为此,我们提出了DOPPLER,一种基于指令由线性时间逻辑(LTL)指定的数据驱动分层框架。我们的方法将时间任务分解为具有分层强化学习从离线非专家数据集生成的选项链。它利用扩散模型生成具有低级动作的选项。我们在批生成过程中设计了一种确定性指导的后验采样技术,从而改善了扩散生成的选项的速度和多样性,导致更有效的查询。在机器人导航和操作任务上的实验证明,DOPPLER可以生成满足避障和顺序访问指定形式的轨迹序列。演示视频可在该网址上观看:https://this URL。
https://arxiv.org/abs/2410.02389
Dynamic and interactive traffic scenarios pose significant challenges for autonomous driving systems. Reinforcement learning (RL) offers a promising approach by enabling the exploration of driving policies beyond the constraints of pre-collected datasets and predefined conditions, particularly in complex environments. However, a critical challenge lies in effectively extracting spatial and temporal features from sequences of high-dimensional, multi-modal observations while minimizing the accumulation of errors over time. Additionally, efficiently guiding large-scale RL models to converge on optimal driving policies without frequent failures during the training process remains tricky. We propose an end-to-end model-based RL algorithm named Ramble to address these issues. Ramble processes multi-view RGB images and LiDAR point clouds into low-dimensional latent features to capture the context of traffic scenarios at each time step. A transformer-based architecture is then employed to model temporal dependencies and predict future states. By learning a dynamics model of the environment, Ramble can foresee upcoming traffic events and make more informed, strategic decisions. Our implementation demonstrates that prior experience in feature extraction and decision-making plays a pivotal role in accelerating the convergence of RL models toward optimal driving policies. Ramble achieves state-of-the-art performance regarding route completion rate and driving score on the CARLA Leaderboard 2.0, showcasing its effectiveness in managing complex and dynamic traffic situations.
动态和交互式的交通场景对自动驾驶系统来说具有巨大的挑战。强化学习(RL)通过允许在预收集数据和预定义条件的限制之外探索驾驶策略,为解决这个挑战提供了一个有前景的方法。然而,关键挑战在于在时间上有效地提取高维、多模态观测序列中的空间和时间特征,同时最小化误差积累。此外,在训练过程中有效地引导大规模RL模型收敛到最优驾驶策略,同时避免频繁的训练失败也具有挑战性。为了应对这些问题,我们提出了一个基于模型的强化学习算法,名为Ramble。 Ramble将多视角的RGB图像和激光点云转换为低维的潜在特征,以捕捉每个时间步度的交通场景的上下文。然后采用Transformer架构来建模时间依赖关系并预测未来状态。通过学习环境的动态模型,Ramble可以预测即将到来的交通事件,并做出更有力的决策。 我们的实现证明了在特征提取和决策方面最初的经验对于加速RL模型达到最优驾驶策略的收敛速度具有关键作用。Ramble在CARLA Leaderboard 2.0上实现了与最先进性能相关的路线完成率和驾驶评分,展示了其在处理复杂和动态交通情况方面的有效性。
https://arxiv.org/abs/2410.02253
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
大语言模型(LLMs)在自然语言理解和生成方面取得了显著的进步,这是由可扩展的预训练和先进的微调推动的。然而,通过人反馈强化学习(RLHF)增强LLMs的推理能力仍然具有挑战性,因为高质量偏好数据的稀缺性使得 annotate变得困难,而且对于奖励模型(RM)微调至关重要。为了减轻这个问题,我们引入了CodePMP,一种可扩展的偏好模型预训练(PMP)流程,它利用来自公开可用高质量源代码的大量合成代码偏好对。CodePMP通过在大型合成代码偏好对上预训练偏好模型,提高了RM微调的效率。我们在数学推理任务(GSM8K,MATH)和逻辑推理任务(ReClor,LogiQA2.0)上评估CodePMP, consistently显示LLM的推理性能显著提高,并强调了可扩展偏好模型预训练对高效奖励建模的重要性。
https://arxiv.org/abs/2410.02229
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at this https URL.
建模人类偏好对齐基础模型与人类价值至关重要。传统的奖励建模方法,如布拉德利-特里(BT)奖励模型,在表现力上存在不足,尤其是在解决非Transititive偏好方面。尽管监督成对偏好模型(PairPM)可以表达一般偏好,但它们的实现非常随意,并且不能保证比较对的偏好概率的一致性。此外,它们在比较多个答案时,由于其二次查询复杂度而产生高计算成本。在本文中,我们引入了偏好表示学习,一种将响应嵌入到潜在空间中,以捕捉复杂偏好结构的途径,实现线性查询复杂度的方法。此外,我们提出了基于偏好的通用偏好优化(GPO),将人类反馈为基础的强化学习扩展到人类。实验结果表明,我们在RewardBench基准上,GPM显著优于BT奖励模型,其领先优势达到5.6%,并且有效地建模了环形偏好,其中任何BT奖励模型都像随机猜测一样行为。此外,在下游任务如AlpacaEval2.0和MT-Bench上,使用GPO和我们的通用偏好模型进行语言模型后训练,评估结果表明,性能改进的幅度达到9.3%。这些发现表明,我们的方法可能有助于增强基础模型与复杂人类价值的对齐。代码可在此链接下载:https://www.aclweb.org/anthology/W21-4246
https://arxiv.org/abs/2410.02197
Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators -- which include existing OPE methods as special cases -- that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.
使用离线数据评估策略对于将强化学习应用于现实世界问题(如医疗和自动驾驶)中至关重要。之前的方法(OPE)通常具有高方差或不可归因偏见,导致预测误差过高。在本文中,我们引入了STAR,一个涵盖了各种估计算法的框架——包括现有OPE方法的特殊情况——从而实现较低的均方预测误差。STAR利用状态抽象将复杂、可能连续的问题压缩成我们称之为抽象奖励过程(ARPs)的紧凑离散模型。ARPs从离线数据中估计的预测是可证明的一致的(渐进式正确)。我们而不是提出一个特定的估计算法,而是介绍了一个新的OPE框架,并实证证明STAR中的估计算法优于现有方法。在所研究的所有12个案例中,最佳STAR估计算法都超过了基线,而且即使在最佳情况下,STAR的均方估计算法也超过了基线。
https://arxiv.org/abs/2410.02172
Recent advancements in humanoid robotics, including the integration of hierarchical reinforcement learning-based control and the utilization of LLM planning, have significantly enhanced the ability of robots to perform complex tasks. In contrast to the highly developed humanoid robots, the human factors involved remain relatively unexplored. Directly controlling humanoid robots with the brain has already appeared in many science fiction novels, such as Pacific Rim and Gundam. In this work, we present E2H (EEG-to-Humanoid), an innovative framework that pioneers the control of humanoid robots using high-frequency non-invasive neural signals. As the none-invasive signal quality remains low in decoding precise spatial trajectory, we decompose the E2H framework in an innovative two-stage formation: 1) decoding neural signals (EEG) into semantic motion keywords, 2) utilizing LLM facilitated motion generation with a precise motion imitation control policy to realize humanoid robotics control. The method of directly driving robots with brainwave commands offers a novel approach to human-machine collaboration, especially in situations where verbal commands are impractical, such as in cases of speech impairments, space exploration, or underwater exploration, unlocking significant potential. E2H offers an exciting glimpse into the future, holding immense potential for human-computer interaction.
近年来,在人形机器人技术的发展中,包括集成等级强化学习控制和利用LLM规划,显著增强了机器人执行复杂任务的能力。相比之下,高度发展的人形机器人,涉及的人机因素仍然相对较少被探索。已经出现了直接用脑控制人形机器人的科幻小说,如《太平洋 Rim》和《机巧》。在这篇论文中,我们提出了E2H(EEG-to-Humanoid),一种创新的人形机器人控制框架,利用高频率的非侵入性神经信号控制人形机器人。由于非侵入性信号质量在解码精确运动轨迹时仍然较低,我们将其分为创新的两阶段形成:1)将EEG中的神经信号解码为语义运动关键词,2)利用LLM辅助运动生成策略实现人形机器人控制。这种直接用脑控制机器人的方法为人机协作提供了一种新的途径,尤其是在无法通过语言命令实现情况下的情况,例如言语障碍、空间探索或水下探索,具有很大的潜在能量。E2H为人类与计算机交互开辟了一个激动人心的未来,具有巨大的潜在能量。
https://arxiv.org/abs/2410.02141
Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
大语言模型(LLMs)作为代理在多个步骤中解决用户指定任务,同时将所需的手动干预降至最低。关键是,这样的LLMs需要将生成的代码片段 grounded 在获得的任何反馈上,以可靠地实现所需的结果。我们提出了一种端到端的强化学习方法,用于教导模型在代码合成领域利用执行反馈,而最先进的LLM在迭代过程中无法改进代码。我们在竞争性编程任务上进行基准测试,使用小(8B参数)和大(70B)模型,同时将所需的样本量减少到原来的十倍。我们对推理时间的分析表明,我们的方法生成了能够有效利用多个步骤的自动反馈的LLM。
https://arxiv.org/abs/2410.02089
Endovascular interventions are a life-saving treatment for many diseases, yet suffer from drawbacks such as radiation exposure and potential scarcity of proficient physicians. Robotic assistance during these interventions could be a promising support towards these problems. Research focusing on autonomous endovascular interventions utilizing artificial intelligence-based methodologies is gaining popularity. However, variability in assessment environments hinders the ability to compare and contrast the efficacy of different approaches, primarily due to each study employing a unique evaluation framework. In this study, we present deep reinforcement learning-based autonomous endovascular device navigation on three distinct digital benchmark interventions: BasicWireNav, ArchVariety, and DualDeviceNav. The benchmark interventions were implemented with our modular simulation framework stEVE (simulated EndoVascular Environment). Autonomous controllers were trained solely in simulation and evaluated in simulation and on physical test benches with camera and fluoroscopy feedback. Autonomous control for BasicWireNav and ArchVariety reached high success rates and was successfully transferred from the simulated training environment to the physical test benches, while autonomous control for DualDeviceNav reached a moderate success rate. The experiments demonstrate the feasibility of stEVE and its potential for transferring controllers trained in simulation to real-world scenarios. Nevertheless, they also reveal areas that offer opportunities for future research. This study demonstrates the transferability of autonomous controllers from simulation to the real world in endovascular navigation and lowers the entry barriers and increases the comparability of research on endovascular assistance systems by providing open-source training scripts, benchmarks and the stEVE framework.
血管内干预治疗许多疾病具有救生作用,但也存在一些缺点,如辐射暴露和熟练医师的潜在短缺。在這些治療過程中,機器人協助可能有助於解決這些問題。致力於使用人工智能方法學進行自主血管內治療的研究越來越受到歡迎。然而,評估環境的變化會阻礙比較和差異不同方法的有效性,主要因為每項研究都採用了一個獨特的評估框架。在這個研究中,我們在三個不同的數字基准治療中介紹了基於深度强化學習的自動血管內器械導航:基本電纜導航、ArchVariety和雙設備導航。這些標記性治療是用我們的模擬仿真的框架stEVE(模擬生動血管環境)實現的。自動控制器僅在模擬環境中進行訓練,并通过模擬和攝像頭反饋進行評估。自動控制for BasicWireNav和ArchVariety达到了高成功率,並成功從模擬訓練環境轉移到實際測試台,而自動控制for DualDeviceNav的成功率較高。實驗證明了stEVE和將訓練在模擬環境中的控制器轉移到現實環境的可能性。然而,它們也揭示了未來研究的機會。這項研究將自動控制器從模擬環境轉移到現實環境在血管內導航中的可行性证明了,并为研究人員提供了開放源代碼訓練腳本、基准和stEVE框架,从而降低了進入障礙,提高了研究結果的可比性。
https://arxiv.org/abs/2410.01956
The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While soft mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.
在强化学习(RL)中使用深度神经网络通常会导致模型大小增加时性能下降。虽然最近软专家混合(SoftMoEs)已经在减轻这一问题方面显示出前景,但它们的有效性背后的原因仍然知之甚少。在这项工作中,我们进行了深入的分析,确定了推动SoftMoEs性能提升的关键因素。我们发现,对编码器输出进行标记,而不是使用多个专家,是SoftMoEs有效性的关键原因。事实上,我们证明了即使使用适当的缩放单专家,我们仍然能够维持性能提升,很大程度上归功于标记。
https://arxiv.org/abs/2410.01930
One of the fundamental challenges in reinforcement learning (RL) is to take a complex task and be able to decompose it to subtasks that are simpler for the RL agent to learn. In this paper, we report on our work that would identify subtasks by using some given positive and negative trajectories for solving the complex task. We assume that the states are represented by first-order predicate logic using which we devise a novel algorithm to identify the subtasks. Then we employ a Large Language Model (LLM) to generate first-order logic rule templates for achieving each subtask. Such rules were then further fined tuned to a rule-based policy via an Inductive Logic Programming (ILP)-based RL agent. Through experiments, we verify the accuracy of our algorithm in detecting subtasks which successfully detect all of the subtasks correctly. We also investigated the quality of the common-sense rules produced by the language model to achieve the subtasks. Our experiments show that our LLM-guided rule template generation can produce rules that are necessary for solving a subtask, which leads to solving complex tasks with fewer assumptions about predefined first-order logic predicates of the environment.
强化学习(RL)的一个基本挑战是,要将一个复杂任务分解为对RL代理商来说更简单的子任务。在本文中,我们报告了使用一些给定的积极和消极轨迹来识别子任务的 work。我们假设状态由使用一阶谓词逻辑表示,我们设计了一个新算法来识别子任务。然后我们使用大型语言模型(LLM)生成每个子任务的 first-order 逻辑规则模板。这些规则随后通过基于规则的策略进行微调,通过基于归纳逻辑编程(ILP)的RL 代理商。通过实验,我们证实了我们的算法在检测子任务方面的准确性。我们还研究了语言模型产生的共同感觉规则的质量以实现子任务。我们的实验表明,我们的 LLM 指导规则模板生成可以生成解决子任务所需的规则,从而减少了关于预定义的一阶谓词逻辑变量的假设。
https://arxiv.org/abs/2410.01929
The growing interest in human-robot collaboration (HRC), where humans and robots cooperate towards shared goals, has seen significant advancements over the past decade. While previous research has addressed various challenges, several key issues remain unresolved. Many domains within HRC involve activities that do not necessarily require human presence throughout the entire task. Existing literature typically models HRC as a closed system, where all agents are present for the entire duration of the task. In contrast, an open model offers flexibility by allowing an agent to enter and exit the collaboration as needed, enabling them to concurrently manage other tasks. In this paper, we introduce a novel multiagent framework called oDec-MDP, designed specifically to model open HRC scenarios where agents can join or leave tasks flexibly during execution. We generalize a recent multiagent inverse reinforcement learning method - Dec-AIRL to learn from open systems modeled using the oDec-MDP. Our method is validated through experiments conducted in both a simplified toy firefighting domain and a realistic dyadic human-robot collaborative assembly. Results show that our framework and learning method improves upon its closed system counterpart.
近年来,对人类-机器人协作(HRC)的兴趣不断增长,涉及到的领域越来越多。虽然之前的研究解决了各种挑战,但是一些关键问题仍然没有解决。许多HRC领域包括的活动并不一定需要在整个任务期间都有人类参与。现有的文献通常将HRC建模为闭合系统,其中所有代理在整个任务的整个过程中都存在。相比之下,一种开放模型通过允许代理根据需要进入和退出协作,实现了灵活管理其他任务的能力。在本文中,我们引入了一个名为oDec-MDP的新多代理框架,专门用于建模具有灵活任务加入或退出能力的开放HRC场景。我们将最近的多代理反强化学习方法 - Dec-AIRL 扩展到使用oDec-MDP建模的开放系统中。我们的方法通过在简化玩具消防战斗领域和真实的人机协作装配中进行实验来验证。结果表明,与闭系统相比,我们的框架和学习方法有所改进。
https://arxiv.org/abs/2410.01790
In this article, we investigate the alignment of Large Language Models according to human preferences. We discuss the features of training a Preference Model, which simulates human preferences, and the methods and details we found essential for achieving the best results. We also discuss using Reinforcement Learning to fine-tune Large Language Models and describe the challenges we faced and the ways to overcome them. Additionally, we present our experience with the Direct Preference Optimization method, which enables us to align a Large Language Model with human preferences without creating a separate Preference Model. As our contribution, we introduce the approach for collecting a preference dataset through perplexity filtering, which makes the process of creating such a dataset for a specific Language Model much easier and more cost-effective.
在本文中,我们研究了根据人类偏好对大型语言模型进行对齐的过程。我们讨论了训练偏好模型的特点,该模型模拟人类偏好,以及我们发现对于获得最佳结果必不可少的方法和细节。我们还讨论了使用强化学习微调大型语言模型,并描述了我们所面临的挑战以及克服这些挑战的方法。此外,我们还介绍了使用直接偏好优化方法使大型语言模型与人类偏好对齐的经验,该方法无需创建单独的偏好模型。作为我们的贡献,我们引入了通过误分率过滤收集偏好数据的方法,使为特定语言模型创建此类数据的过程变得更容易和更经济实惠。
https://arxiv.org/abs/2410.01789