Robotics policies are always subjected to complex, second order dynamics that entangle their actions with resulting states. In reinforcement learning (RL) contexts, policies have the burden of deciphering these complicated interactions over massive amounts of experience and complex reward functions to learn how to accomplish tasks. Moreover, policies typically issue actions directly to controllers like Operational Space Control (OSC) or joint PD control, which induces straightline motion towards these action targets in task or joint space. However, straightline motion in these spaces for the most part do not capture the rich, nonlinear behavior our robots need to exhibit, shifting the burden of discovering these behaviors more completely to the agent. Unlike these simpler controllers, geometric fabrics capture a much richer and desirable set of behaviors via artificial, second order dynamics grounded in nonlinear geometry. These artificial dynamics shift the uncontrolled dynamics of a robot via an appropriate control law to form behavioral dynamics. Behavioral dynamics unlock a new action space and safe, guiding behavior over which RL policies are trained. Behavioral dynamics enable bang-bang-like RL policy actions that are still safe for real robots, simplify reward engineering, and help sequence real-world, high-performance policies. We describe the framework more generally and create a specific instantiation for the problem of dexterous, in-hand reorientation of a cube by a highly actuated robot hand.
机器人策略总是受到复杂、二级非线性动力学的交互作用,这些交互作用会它们的动作与结果状态紧密耦合。在强化学习(RL)背景下,策略有义务通过大量的经验和复杂的奖励函数来揭示这些复杂交互,从而学会完成任务。此外,策略通常直接向诸如操作空间控制(OSC)或联合体比例控制(JPC)等控制器发出动作,这导致在任务或关节空间中产生直线运动,指向这些动作目标。然而,这些空间中的直线运动在很大程度上并没有捕捉到机器人需要展现的丰富、非线性的行为,将发现这些行为的负担更多地转移给代理。 与这些简单的控制器不同,几何面料通过基于非线性几何的人工第二级动力学捕捉到一个更丰富、更具有吸引力的行为集合。这些人工动力学通过适当的控制律将机器人的无控制动力学转移到行为动态中。行为动态开辟了一个新的动作空间,并引导机器人通过RL策略进行安全、引导行为。行为动态简化了奖励工程,并帮助实现真实世界的高性能策略。 我们更一般地描述这个框架,并为掌握高度激活的机器人手进行立方体灵活调整的问题创建了一个具体的实例。
https://arxiv.org/abs/2405.02250
Compact robotic platforms with powerful compute and actuation capabilities are key enablers for practical, real-world deployments of multi-agent research. This article introduces a tightly integrated hardware, control, and simulation software stack on a fleet of holonomic ground robot platforms designed with this motivation. Our robots, a fleet of customised DJI Robomaster S1 vehicles, offer a balance between small robots that do not possess sufficient compute or actuation capabilities and larger robots that are unsuitable for indoor multi-robot tests. They run a modular ROS2-based optimal estimation and control stack for full onboard autonomy, contain ad-hoc peer-to-peer communication infrastructure, and can zero-shot run multi-agent reinforcement learning (MARL) policies trained in our vectorized multi-agent simulation framework. We present an in-depth review of other platforms currently available, showcase new experimental validation of our system's capabilities, and introduce case studies that highlight the versatility and reliabilty of our system as a testbed for a wide range of research demonstrations. Our system as well as supplementary material is available online: this https URL
紧凑型机器人平台具有强大的计算和执行能力是多智能体研究实际应用的关键推动力。本文介绍了一种基于holonomic地面机器人平台的设计,该平台具有强大的计算和执行能力,以实现多智能体研究的实际应用。我们的机器人,是一支由DJI Robomaster S1车辆组成的定制车队,提供小机器人不具备足够的计算或执行能力,以及不适合室内多机器人测试的较大机器人的平衡。它们运行了一个基于ROS2的模块化最优估计和控制栈来实现全车载自主,包含一个自适应的点对点通信基础设施,并且可以通过我们的向量式多智能体仿真框架训练的多智能体强化学习(MARL)策略实现零 shots。我们深入研究了其他可用的平台,展示了我们系统能力的实验验证,并引入了案例研究,突出了我们系统作为各种研究展示平台的多样性和可靠性。我们的系统和补充材料均可在线获取:https://www. this URL
https://arxiv.org/abs/2405.02198
Agent-based models (ABMs) are simulation models used in economics to overcome some of the limitations of traditional frameworks based on general equilibrium assumptions. However, agents within an ABM follow predetermined, not fully rational, behavioural rules which can be cumbersome to design and difficult to justify. Here we leverage multi-agent reinforcement learning (RL) to expand the capabilities of ABMs with the introduction of fully rational agents that learn their policy by interacting with the environment and maximising a reward function. Specifically, we propose a 'Rational macro ABM' (R-MABM) framework by extending a paradigmatic macro ABM from the economic literature. We show that gradually substituting ABM firms in the model with RL agents, trained to maximise profits, allows for a thorough study of the impact of rationality on the economy. We find that RL agents spontaneously learn three distinct strategies for maximising profits, with the optimal strategy depending on the level of market competition and rationality. We also find that RL agents with independent policies, and without the ability to communicate with each other, spontaneously learn to segregate into different strategic groups, thus increasing market power and overall profits. Finally, we find that a higher degree of rationality in the economy always improves the macroeconomic environment as measured by total output, depending on the specific rational policy, this can come at the cost of higher instability. Our R-MABM framework is general, it allows for stable multi-agent learning, and represents a principled and robust direction to extend existing economic simulators.
基于代理的模型(ABMs)是一种用于经济学中的模拟模型,以克服基于一般均衡假设的传统框架的一些局限性。然而,ABM中的代理遵循预设的、非完全理性的行为规则,这可能会导致设计复杂且难以证明。在这里,我们利用多智能体强化学习(RL)来通过引入完全理性的代理来扩展ABM的功能,使得代理通过与环境交互并最大化奖励函数来学习其策略。具体来说,我们提出了一个“理性宏观ABM”框架,该框架在经济学文献中是对典型宏观ABM的扩展。我们证明了,逐步用RL代理商替换模型中的ABM企业,训练以最大化利润,可以深入研究理性对经济的影响。我们发现,RL代理商自发地学习三种最大化利润的策略,最优策略取决于市场竞争的水平和理性程度。我们还发现,具有独立策略的RL代理商,以及无法相互通信的代理商,会自发地学习将企业划分为不同的战略组,从而增加市场实力和整体利润。最后,我们发现,经济中的理性程度越高,总产出水平越大,这取决于具体的理性政策,但这也带来了更高的不稳定性。我们的R-MABM框架是通用的,它允许稳定的多智能体学习,并且代表了扩展现有经济模拟器的一个理性和稳健的方向。
https://arxiv.org/abs/2405.02161
Recent work has shown that deep neural networks are capable of approximating both value functions and policies in reinforcement learning domains featuring continuous state and action spaces. However, to the best of our knowledge no previous work has succeeded at using deep neural networks in structured (parameterized) continuous action spaces. To fill this gap, this paper focuses on learning within the domain of simulated RoboCup soccer, which features a small set of discrete action types, each of which is parameterized with continuous variables. The best learned agent can score goals more reliably than the 2012 RoboCup champion agent. As such, this paper represents a successful extension of deep reinforcement learning to the class of parameterized action space MDPs.
近年来,研究表明,深度神经网络在具有连续状态和动作空间的强化学习领域中可以逼近价值函数和策略。然而,据我们所知,在具有结构化(参数化)连续动作空间的领域中,还没有前人成功地将深度神经网络应用于其中。为了填补这一空白,本文将专注于在模拟RoboCup足球领域中学习,该领域具有较小的离散动作类型,并且每个动作类型都由连续变量进行参数化。通过训练,可以获得比2012 RoboCup冠军代理商更可靠的得分能力。因此,本文代表了对参数化动作空间MDPs的深度强化学习的成功扩展。
https://arxiv.org/abs/1511.04143
Robust Reinforcement Learning (RRL) is a promising Reinforcement Learning (RL) paradigm aimed at training robust to uncertainty or disturbances models, making them more efficient for real-world applications. Following this paradigm, uncertainty or disturbances are interpreted as actions of a second adversarial agent, and thus, the problem is reduced to seeking the agents' policies robust to any opponent's actions. This paper is the first to propose considering the RRL problems within the positional differential game theory, which helps us to obtain theoretically justified intuition to develop a centralized Q-learning approach. Namely, we prove that under Isaacs's condition (sufficiently general for real-world dynamical systems), the same Q-function can be utilized as an approximate solution of both minimax and maximin Bellman equations. Based on these results, we present the Isaacs Deep Q-Network algorithms and demonstrate their superiority compared to other baseline RRL and Multi-Agent RL algorithms in various environments.
鲁棒强化学习(RRL)是一个有前景的强化学习(RL)范式,旨在训练对不确定或扰动具有鲁棒性的模型,使其在现实应用中更加高效。遵循这一范式,不确定性或扰动被解释为第二个对抗代理的行动,因此问题 reduction为寻求具有对任何对手行动鲁棒的代理策略。本文是第一个考虑在位置微分游戏理论中提出RRL问题的论文,这有助于我们获得理论证明,以开发一种集中式Q学习方法。具体来说,我们证明了在Isaacs的条件下(对于现实世界动态系统足够通用),相同Q函数可以作为最小最大Bellman方程的近似解。基于这些结果,我们提出了Isaacs深度Q网络算法,并在各种环境中证明了它们与其他基线RRL和多代理器RL算法的优越性。
https://arxiv.org/abs/2405.02044
Designing protein nanomaterials of predefined shape and characteristics has the potential to dramatically impact the medical industry. Machine learning (ML) has proven successful in protein design, reducing the need for expensive wet lab experiment rounds. However, challenges persist in efficiently exploring the protein fitness landscapes to identify optimal protein designs. In response, we propose the use of AlphaZero to generate protein backbones, meeting shape and structural scoring requirements. We extend an existing Monte Carlo tree search (MCTS) framework by incorporating a novel threshold-based reward and secondary objectives to improve design precision. This innovation considerably outperforms existing approaches, leading to protein backbones that better respect structural scores. The application of AlphaZero is novel in the context of protein backbone design and demonstrates promising performance. AlphaZero consistently surpasses baseline MCTS by more than 100% in top-down protein design tasks. Additionally, our application of AlphaZero with secondary objectives uncovers further promising outcomes, indicating the potential of model-based reinforcement learning (RL) in navigating the intricate and nuanced aspects of protein design
设计具有预定形状和特性的蛋白质纳米材料,将对医疗行业产生重大影响。机器学习(ML)在蛋白质设计方面取得了成功,减少了昂贵的湿实验室实验轮数。然而,在 efficiently 探索蛋白质适应度景观以确定最优蛋白质设计方面仍然存在挑战。因此,我们提出了使用AlphaZero生成蛋白质骨架,满足形状和结构评分要求。我们通过引入一种新颖的基于阈值的奖励和次要目标扩展了现有的蒙特卡洛树搜索(MCTS)框架,以提高设计精度。这一创新大大超过了现有方法,导致设计出的蛋白质骨架更尊重结构评分。AlphaZero在蛋白质骨架设计方面的应用是新颖的,并展示了具有潜力的应用。AlphaZero在自顶向下蛋白质设计任务中 consistently超过基线MCTS超过100%。此外,我们使用AlphaZero进行次要目标的应用揭示了进一步 promising 的结果,表明了基于模型的强化学习(RL)在导航蛋白质设计中的复杂和细微方面具有潜力。
https://arxiv.org/abs/2405.01983
The neural combinatorial optimization (NCO) approach has shown great potential for solving routing problems without the requirement of expert knowledge. However, existing constructive NCO methods cannot directly solve large-scale instances, which significantly limits their application prospects. To address these crucial shortcomings, this work proposes a novel Instance-Conditioned Adaptation Model (ICAM) for better large-scale generalization of neural combinatorial optimization. In particular, we design a powerful yet lightweight instance-conditioned adaptation module for the NCO model to generate better solutions for instances across different scales. In addition, we develop an efficient three-stage reinforcement learning-based training scheme that enables the model to learn cross-scale features without any labeled optimal solution. Experimental results show that our proposed method is capable of obtaining excellent results with a very fast inference time in solving Traveling Salesman Problems (TSPs) and Capacitated Vehicle Routing Problems (CVRPs) across different scales. To the best of our knowledge, our model achieves state-of-the-art performance among all RL-based constructive methods for TSP and CVRP with up to 1,000 nodes.
神经组合优化(NCO)方法在解决无需专家知识的路由问题方面具有巨大的潜力。然而,现有的构建性NCO方法无法直接解决大规模实例,这严重限制了它们的应用前景。为了应对这些关键不足,本文提出了一个新的事例条件适应模型(ICAM)来更好地处理大规模神经组合优化。 特别是,我们为NCO模型设计了一个强大而轻量级的实例条件适应模块,以生成更好的实例解。此外,我们开发了一个高效的基于强化学习的三阶段训练计划,使模型能够在没有任何有标签最优解的情况下学习跨规模特征。 实验结果表明,与不同规模下的旅行商问题(TSPs)和容量车辆路由问题(CVRPs)相比,我们提出的方法具有极快的推理速度。据我们所知,我们的模型在所有基于RL的构建性方法中取得了最先进的性能,节点数量达到1000个以上。
https://arxiv.org/abs/2405.01906
Recommender selects and presents top-K items to the user at each online request, and a recommendation session consists of several sequential requests. Formulating a recommendation session as a Markov decision process and solving it by reinforcement learning (RL) framework has attracted increasing attention from both academic and industry communities. In this paper, we propose a RL-based industrial short-video recommender ranking framework, which models and maximizes user watch-time in an environment of user multi-aspect preferences by a collaborative multi-agent formulization. Moreover, our proposed framework adopts a model-based learning approach to alleviate the sample selection bias which is a crucial but intractable problem in industrial recommender system. Extensive offline evaluations and live experiments confirm the effectiveness of our proposed method over alternatives. Our proposed approach has been deployed in our real large-scale short-video sharing platform, successfully serving over hundreds of millions users.
推荐系统选择并提供给用户每个在线请求的前K个物品,而推荐会话包括多个连续的请求。将推荐会话表示为一个马尔可夫决策过程并通过强化学习(RL)框架求解,已经吸引了学术界和工业界的越来越多的关注。在本文中,我们提出了一个基于强化学习的工业短视频推荐排名框架,该框架通过多智能体协同公式建模并最大化用户的观看时间来模拟和优化用户的多方面偏好。此外,我们所提出的框架采用基于模型的学习方法来减轻工业推荐系统中样本选择偏差这个关键但难以解决的问题。 广泛的离线评估和现场实验证实了我们所提出方法的有效性超过了其他替代方案。我们的方法已经成功部署在我们的大型视频分享平台上,为数超过数百万用户提供了服务。
https://arxiv.org/abs/2405.01847
Multi-agent systems (MAS) need to adaptively cope with dynamic environments, changing agent populations, and diverse tasks. However, most of the multi-agent systems cannot easily handle them, due to the complexity of the state and task space. The social impact theory regards the complex influencing factors as forces acting on an agent, emanating from the environment, other agents, and the agent's intrinsic motivation, referring to the social force. Inspired by this concept, we propose a novel gradient-based state representation for multi-agent reinforcement learning. To non-trivially model the social forces, we further introduce a data-driven method, where we employ denoising score matching to learn the social gradient fields (SocialGFs) from offline samples, e.g., the attractive or repulsive outcomes of each force. During interactions, the agents take actions based on the multi-dimensional gradients to maximize their own rewards. In practice, we integrate SocialGFs into the widely used multi-agent reinforcement learning algorithms, e.g., MAPPO. The empirical results reveal that SocialGFs offer four advantages for multi-agent systems: 1) they can be learned without requiring online interaction, 2) they demonstrate transferability across diverse tasks, 3) they facilitate credit assignment in challenging reward settings, and 4) they are scalable with the increasing number of agents.
多智能体系统(MAS)需要适应性地应对动态环境、变化的人工智能体种和多样化的任务。然而,大多数MAS无法轻松处理这些复杂状态和任务空间。社会影响理论将复杂的影响力因素视为作用于智能体、来自环境、其他智能体以及智能体内在动机的力量,即社会力。受到这个概念的启发,我们提出了一个新颖的基于梯度的多智能体强化学习状态表示。为了非平凡地建模社会力,我们进一步引入了一种数据驱动的方法,其中我们使用去噪评分匹配来从离线样本中学习社会梯度场(SocialGFs),例如每个力的吸引或排斥后果。在交互过程中,智能体根据多维梯度采取行动,以最大化自己的奖励。在实践中,我们将SocialGFs集成到广泛使用的多智能体强化学习算法中,如MAPPO。实证结果表明,SocialGFs对多智能体系统具有以下四个优点:1)无需在线交互即可学习,2)展示了跨多样化任务的传递性,3)有助于在具有挑战性的奖励设置中进行信道分配,4)随着智能体数量的增加,具有可扩展性。
https://arxiv.org/abs/2405.01839
Autonomous wheeled-legged robots have the potential to transform logistics systems, improving operational efficiency and adaptability in urban environments. Navigating urban environments, however, poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation. These challenges include the need for adaptive locomotion across varied terrains and the ability to navigate efficiently around complex dynamic obstacles. This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city. Using model-free reinforcement learning (RL) techniques and privileged learning, we develop a versatile locomotion controller. This controller achieves efficient and robust locomotion over various rough terrains, facilitated by smooth transitions between walking and driving modes. It is tightly integrated with a learned navigation controller through a hierarchical RL framework, enabling effective navigation through challenging terrain and various obstacles at high speed. Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain. These missions demonstrate the system's robustness and adaptability, underscoring the importance of integrated control systems in achieving seamless navigation in complex environments. Our findings support the feasibility of wheeled-legged robots and hierarchical RL for autonomous navigation, with implications for last-mile delivery and beyond.
自动驾驶轮式机器人具有潜力彻底改变物流系统,提高操作效率和适应城市环境的灵活性。然而,在导航城市环境中还存在独特的挑战,对机器人的运动和导航提出了创新解决方案。这些挑战包括在各种地形上进行自适应运动以及高效地围绕复杂动态障碍物进行导航。本文介绍了一种集成系统,包括自适应运动控制、面向移动性的局部路径规划和城市规模路径规划。我们使用基于模型无关强化学习(RL)技术和优先学习方法开发了一个多功能的运动控制器。该控制器在各种崎岖不平的地面上实现高效的稳健运动,得益于平滑的步行和驾驶模式之间的转换。它与通过分层的RL框架集成的学习导航控制器紧密集成,使机器人能够有效通过具有挑战性的地形和各种障碍物的高速导航。我们的控制器被集成到大型城市导航系统中,并通过瑞士苏黎世和西班牙塞维利亚等地进行的自主、公里级导航任务进行了验证。这些任务突显了系统的稳健性和适应性,进一步强调了集成控制系统在复杂环境中实现无缝导航的重要性。我们的研究结果支持轮式机器人的可行性和层次式RL在自主导航方面的应用,这对末端交付和更广阔的应用领域都有重要的意义。
https://arxiv.org/abs/2405.01792
In recent years, semi-supervised learning (SSL) has gained significant attention due to its ability to leverage both labeled and unlabeled data to improve model performance, especially when labeled data is scarce. However, most current SSL methods rely on heuristics or predefined rules for generating pseudo-labels and leveraging unlabeled data. They are limited to exploiting loss functions and regularization methods within the standard norm. In this paper, we propose a novel Reinforcement Learning (RL) Guided SSL method, RLGSSL, that formulates SSL as a one-armed bandit problem and deploys an innovative RL loss based on weighted reward to adaptively guide the learning process of the prediction model. RLGSSL incorporates a carefully designed reward function that balances the use of labeled and unlabeled data to enhance generalization performance. A semi-supervised teacher-student framework is further deployed to increase the learning stability. We demonstrate the effectiveness of RLGSSL through extensive experiments on several benchmark datasets and show that our approach achieves consistent superior performance compared to state-of-the-art SSL methods.
近年来,半监督学习(SSL)因能够利用标记数据和未标记数据来提高模型性能而受到广泛关注。然而,大多数当前的SSL方法都依赖于规则或预定义的算法生成伪标签,并利用未标记数据。它们只能在标准正则约束内利用损失函数和正则化方法。在本文中,我们提出了一种新颖的强化学习(RL)引导的SSL方法,RLGSSL,将SSL表示为带有一只手边的带约束的动态规划问题,并部署了一种基于加权奖励的创新型RL损失以适应性地指导预测模型的学习过程。RLGSSL包括一个精心设计的奖励函数,可以平衡使用标记数据和未标记数据来提高泛化性能。还进一步部署了一个半监督的教师-学生框架以增加学习稳定性。通过在多个基准数据集上的广泛实验,我们证明了RLGSSL的有效性,并表明我们的方法与最先进的SSL方法相比具有 consistently superior performance。
https://arxiv.org/abs/2405.01760
In the real world, the strong episode resetting mechanisms that are needed to train agents in simulation are unavailable. The \textit{resetting} assumption limits the potential of reinforcement learning in the real world, as providing resets to an agent usually requires the creation of additional handcrafted mechanisms or human interventions. Recent work aims to train agents (\textit{forward}) with learned resets by constructing a second (\textit{backward}) agent that returns the forward agent to the initial state. We find that the termination and timing of the transitions between these two agents are crucial for algorithm success. With this in mind, we create a new algorithm, Reset Free RL with Intelligently Switching Controller (RISC) which intelligently switches between the two agents based on the agent's confidence in achieving its current goal. Our new method achieves state-of-the-art performance on several challenging environments for reset-free RL.
在现实世界中,需要训练代理在模拟中进行强化学习所需的强大 episode reset 机制是不可用的。对齐假设限制了在现实生活中强化学习的潜力,因为提供给代理的重新开始通常需要创建额外的手工机制或人类干预。最近的工作旨在通过构建一个返回初始状态的第二个代理来训练代理(前向代理),我们发现在这两个代理之间转换的终止和时序对算法的成功至关重要。因此,我们创建了一个名为Reset Free RL with Intelligently Switching Controller (RISC)的新算法,该算法根据代理实现其当前目标的信心智能地切换这两个代理。我们的新方法在无重新开始的情况下取得了对几个具有挑战性的环境的最佳性能。
https://arxiv.org/abs/2405.01684
Ensuring the safety of Reinforcement Learning (RL) is crucial for its deployment in real-world applications. Nevertheless, managing the trade-off between reward and safety during exploration presents a significant challenge. Improving reward performance through policy adjustments may adversely affect safety performance. In this study, we aim to address this conflicting relation by leveraging the theory of gradient manipulation. Initially, we analyze the conflict between reward and safety gradients. Subsequently, we tackle the balance between reward and safety optimization by proposing a soft switching policy optimization method, for which we provide convergence analysis. Based on our theoretical examination, we provide a safe RL framework to overcome the aforementioned challenge, and we develop a Safety-MuJoCo Benchmark to assess the performance of safe RL algorithms. Finally, we evaluate the effectiveness of our method on the Safety-MuJoCo Benchmark and a popular safe benchmark, Omnisafe. Experimental results demonstrate that our algorithms outperform several state-of-the-art baselines in terms of balancing reward and safety optimization.
确保强化学习(RL)的安全性对于其在现实应用中的部署至关重要。然而,在探索过程中管理奖励和安全之间的权衡是一个具有挑战性的问题。通过调整策略来提高奖励性能可能会对安全性造成不利影响。在这项研究中,我们旨在通过利用梯度操纵理论来解决这种矛盾关系。首先,我们分析了奖励和安全梯度之间的冲突。接着,我们通过提出软切换策略优化方法来解决奖励和安全优化之间的平衡,并为该方法提供了收敛分析。根据我们的理论审查,我们提供了一个安全的RL框架来克服前述挑战,并开发了一个Safety-MuJoCo基准来评估安全RL算法的性能。最后,我们在Safety-MuJoCo基准和流行的安全基准Omnisafe上评估了我们方法的有效性。实验结果表明,我们的算法在平衡奖励和安全优化方面优于多个最先进的基线。
https://arxiv.org/abs/2405.01677
Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at this https URL
大语言模型(LLMs)已经被证明在长时间的机器人任务中具有执行高级计划的能力。然而,现有的方法需要访问预定义的技能库(例如抓取、放置、拖动、推开、导航)。然而,LLM计划并没有解决如何设计或学习这些行为,这使得在长时间设置中解决这个问题变得更加具有挑战性。此外,对于许多感兴趣的任务,机器人需要能够以细粒度的方式调整其行为,要求代理具备修改低级控制动作的能力。我们可以 instead 使用LLM在高级策略上进行知识表示,指导强化学习(RL)策略有效地解决机器人控制任务,而无需预先确定一组技能?在本文中,我们提出了Plan-Seq-Learn(PSL):一种模块化方法,使用运动规划来桥接抽象语言和学习的低级控制,以从零开始解决长时间的机器人任务。我们证明了PSL在超过25个具有挑战性的机器人任务上取得了最先进的成果,其中包括10个阶段。PSL通过从成功的视觉输入中解决长期机器人任务,其成功率超过85%,超过了基于语言的传统方法、基于任务的端到端方法和基于知识的方法。视频结果和代码在此处:https://url
https://arxiv.org/abs/2405.01534
Alignment is a standard procedure to fine-tune pre-trained large language models (LLMs) to follow natural language instructions and serve as helpful AI assistants. We have observed, however, that the conventional alignment process fails to enhance the factual accuracy of LLMs, and often leads to the generation of more false facts (i.e. hallucination). In this paper, we study how to make the LLM alignment process more factual, by first identifying factors that lead to hallucination in both alignment steps:\ supervised fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that training the LLM on new knowledge or unfamiliar texts can encourage hallucination. This makes SFT less factual as it trains on human labeled data that may be novel to the LLM. Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses. Based on these observations, we propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that our proposed factuality-aware alignment guides LLMs to output more factual responses while maintaining instruction-following capability.
对齐是一种对预训练的大型语言模型(LLMs)进行微调的标准程序,以遵循自然语言指令并作为有帮助的AI助手。然而,我们观察到,传统的对齐过程无法增强LLMs的事实准确性,并通常导致生成更多的虚假事实(即幻觉)。在本文中,我们研究了如何使LLM的对齐过程更加事实准确,通过首先确定导致对齐步骤中出现幻觉的因素:有监督的微调(SFT)和强化学习(RL)。 特别是,我们发现,在为LLM提供新知识或熟悉文本进行训练时,可能会鼓励幻觉。这使得SFT变得不准确,因为它在训练时使用的人类标注数据可能对LLM来说是新颖的。此外,标准RL中使用的奖励函数也可能鼓励幻觉,因为它引导LLM为多样性的指令提供更有帮助的回答,往往更喜欢更长的、更详细的回答。 基于这些观察结果,我们提出了具有事实意识的对齐方法,通过直接偏好优化实现事实意识SFT和事实意识RL。实验证明,我们提出的事实意识对齐引导LLMs输出更准确的事实性响应,同时保持指令跟踪能力。
https://arxiv.org/abs/2405.01525
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to using hundreds of GPUs for training. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at this https URL
将大型语言模型(LLMs)与人类价值观和偏好对齐是使其有帮助和安全的充要条件。然而,构建高效的工具执行对齐可能具有挑战性,尤其是对于包含数十亿或数百亿个参数的大型和最强大的LLM。我们创建了NeMo-Aligner,一个用于模型对齐的工具包,可以高效地扩展到使用数百个GPU进行训练。NeMo-Aligner附带高度优化的可扩展实现,适用于主要模型对齐范式:强化学习来自人类反馈(RLHF)、直接偏好优化(DPO)、SteerLM和自玩微调(SPIN)。此外,我们的工具包支持在参数高效微调(PEFT)设置中运行大多数对齐技术。NeMo-Aligner旨在可扩展性,允许支持其他对齐技术,只需付出很少的努力。它使用Apache 2.0许可证开源,并邀请您在此链接处为社区贡献:https://www.nemoaligner.org/
https://arxiv.org/abs/2405.01481
Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecules to discover candidates with a desired property. We apply LambdaZero with molecular docking to design novel small molecules that inhibit the enzyme soluble Epoxide Hydrolase 2 (sEH), while enforcing constraints on synthesizability and drug-likeliness. LambdaZero provides an exponential speedup in terms of the number of calls to the expensive molecular docking oracle, and LambdaZero de novo designed molecules reach docking scores that would otherwise require the virtual screening of a hundred billion molecules. Importantly, LambdaZero discovers novel scaffolds of synthesizable, drug-like inhibitors for sEH. In in vitro experimental validation, a series of ligands from a generated quinazoline-based scaffold were synthesized, and the lead inhibitor N-(4,6-di(pyrrolidin-1-yl)quinazolin-2-yl)-N-methylbenzamide (UM0152893) displayed sub-micromolar enzyme inhibition of sEH.
尽管在近年来机器学习在科学发现方面取得了实质性进展,但真正实现新药小分子设计仍然是一个重要的挑战。我们引入了LambdaZero,一种基于生成式主动学习搜索合成分子的人工智能方法。LambdaZero借助深度强化学习学会了在庞大的分子空间中搜索有目标性质的分子,并发现具有所需性质的候选分子。我们将LambdaZero与分子对接应用于设计抑制酶可溶性E暴露2(sEH)的新小分子,同时满足合成性和药物相似性的约束。LambdaZero在分子对接或acle的调用次数方面提供了指数级的速度提升,LambdaZero设计的分子达到了其他方法需要进行虚拟筛选的100亿分子数量级。重要的是,LambdaZero发现了sEH的新型可合成、药物类似抑制性支架。在体内实验验证中,基于产生的喹啉支架的一组配体被合成,其中主抑制剂N-(4,6-二(吡咯烷基)-1-yl)喹啉-2-yl)-N-甲基苯胺(UM0152893)显示出对sEH的亚摩尔抑制。
https://arxiv.org/abs/2405.01616
Transesophageal echocardiography (TEE) plays a pivotal role in cardiology for diagnostic and interventional procedures. However, using it effectively requires extensive training due to the intricate nature of image acquisition and interpretation. To enhance the efficiency of novice sonographers and reduce variability in scan acquisitions, we propose a novel ultrasound (US) navigation assistance method based on contrastive learning as goal-conditioned reinforcement learning (GCRL). We augment the previous framework using a novel contrastive patient batching method (CPB) and a data-augmented contrastive loss, both of which we demonstrate are essential to ensure generalization to anatomical variations across patients. The proposed framework enables navigation to both standard diagnostic as well as intricate interventional views with a single model. Our method was developed with a large dataset of 789 patients and obtained an average error of 6.56 mm in position and 9.36 degrees in angle on a testing dataset of 140 patients, which is competitive or superior to models trained on individual views. Furthermore, we quantitatively validate our method's ability to navigate to interventional views such as the Left Atrial Appendage (LAA) view used in LAA closure. Our approach holds promise in providing valuable guidance during transesophageal ultrasound examinations, contributing to the advancement of skill acquisition for cardiac ultrasound practitioners.
经食道超声检查(TEE)在心血管病学中具有重要的诊断和干预作用。然而,要有效地使用它,需要进行广泛的培训,因为图像获取和解释的复杂性。为了提高新手超声技术员的效率,减少扫描获取的变异性,我们提出了一种基于对比学习的目标条件强化学习(GCRL)超声导航辅助方法。我们通过一种新颖的对比患者批量方法(CPB)和数据增强对比损失来增强先前的框架。我们证明了CPB和数据增强对比损失对确保患者间解剖变异的泛化至关重要。所提出的框架能够通过单个模型实现对标准诊断和复杂干预视图的导航。我们的方法基于一个大型数据集(789名患者)开发,在测试数据集(140名患者)上的平均误差为6.56毫米的位置和9.36度的角度,与单个视图训练的模型相当或更好。此外,我们通过定量验证了我们的方法在到达干预视图(如左心房附壁)方面的能力,这些视图在LAA关闭中使用。我们的方法在提供心血管超声检查中的有价值的指导方面具有潜力,有助于提高心脏超声技术员的技能。
https://arxiv.org/abs/2405.01409
Controlling contact forces during interactions is critical for locomotion and manipulation tasks. While sim-to-real reinforcement learning (RL) has succeeded in many contact-rich problems, current RL methods achieve forceful interactions implicitly without explicitly regulating forces. We propose a method for training RL policies for direct force control without requiring access to force sensing. We showcase our method on a whole-body control platform of a quadruped robot with an arm. Such force control enables us to perform gravity compensation and impedance control, unlocking compliant whole-body manipulation. The learned whole-body controller with variable compliance makes it intuitive for humans to teleoperate the robot by only commanding the manipulator, and the robot's body adjusts automatically to achieve the desired position and force. Consequently, a human teleoperator can easily demonstrate a wide variety of loco-manipulation tasks. To the best of our knowledge, we provide the first deployment of learned whole-body force control in legged manipulators, paving the way for more versatile and adaptable legged robots.
在交互过程中控制接触力对于运动和操作任务至关重要。虽然基于模拟-实测强化学习(RL)在许多接触丰富的問題上已经取得了成功,但目前的RL方法在隐式地调节力之外,无法实现有意义的交互。我们提出了一种不需要访问力感测器的直接力控制RL策略的训练方法。我们在四足机器人的全身控制平台上展示了我们的方法。这种力控制使我们能够执行重力补偿和阻尼控制,解锁顺从的全身操作。具有可变顺应性的全身控制器使得人类通过仅命令操作器,就可以轻松地操作机器人,而机器人的身体会自动调整以达到所需的位置和力。因此,人类遥控器可以很容易地展示各种loco-manipulation任务。据我们所知,我们提供了第一个将学习到的全身力控制应用于腿式操作器的部署,为更加多才多艺和适应性强的腿式机器人铺平道路。
https://arxiv.org/abs/2405.01402
The existing Motion Imitation models typically require expert data obtained through MoCap devices, but the vast amount of training data needed is difficult to acquire, necessitating substantial investments of financial resources, manpower, and time. This project combines 3D human pose estimation with reinforcement learning, proposing a novel model that simplifies Motion Imitation into a prediction problem of joint angle values in reinforcement learning. This significantly reduces the reliance on vast amounts of training data, enabling the agent to learn an imitation policy from just a few seconds of video and exhibit strong generalization capabilities. It can quickly apply the learned policy to imitate human arm motions in unfamiliar videos. The model first extracts skeletal motions of human arms from a given video using 3D human pose estimation. These extracted arm motions are then morphologically retargeted onto a robotic manipulator. Subsequently, the retargeted motions are used to generate reference motions. Finally, these reference motions are used to formulate a reinforcement learning problem, enabling the agent to learn a policy for imitating human arm motions. This project excels at imitation tasks and demonstrates robust transferability, accurately imitating human arm motions from other unfamiliar videos. This project provides a lightweight, convenient, efficient, and accurate Motion Imitation model. While simplifying the complex process of Motion Imitation, it achieves notably outstanding performance.
现有的运动模仿模型通常需要通过MoCap设备获得的专家数据,但需要的训练数据量巨大,很难获得,这需要大量的时间和财务资源。本项目将3D人体姿态估计与强化学习相结合,提出了一种将运动模仿简化为强化学习中关节角度预测问题的全新模型。这使得对大量训练数据的依赖程度显著降低,使得代理可以从几秒钟的视频中仅学习几个关节的模仿策略,并表现出强大的泛化能力。它能够快速将学习到的策略应用于不熟悉的视频中的模仿人类手臂运动。首先,使用3D人体姿态估计从给定的视频中提取人体的骨骼运动。然后,这些提取的运动动作被拓扑重构到机器人操作器上。接下来,重构的运动动作用于生成参考动作。最后,这些参考动作被用于构成强化学习问题,使得代理能够学习模仿人类手臂运动的策略。本项目在模仿任务中表现优异,并展示了稳健的泛化能力,准确地将不熟悉的视频中的人类手臂运动模仿出来。本项目提供了一个轻量、方便、高效和准确的动态模仿模型。尽管简化了运动模仿的复杂过程,但取得了显著的优异性能。
https://arxiv.org/abs/2405.01284