The significance of network structures in promoting group cooperation within social dilemmas has been widely recognized. Prior studies attribute this facilitation to the assortment of strategies driven by spatial interactions. Although reinforcement learning has been employed to investigate the impact of dynamic interaction on the evolution of cooperation, there remains a lack of understanding about how agents develop neighbour selection behaviours and the formation of strategic assortment within an explicit interaction structure. To address this, our study introduces a computational framework based on multi-agent reinforcement learning in the spatial Prisoner's Dilemma game. This framework allows agents to select dilemma strategies and interacting neighbours based on their long-term experiences, differing from existing research that relies on preset social norms or external incentives. By modelling each agent using two distinct Q-networks, we disentangle the coevolutionary dynamics between cooperation and interaction. The results indicate that long-term experience enables agents to develop the ability to identify non-cooperative neighbours and exhibit a preference for interaction with cooperative ones. This emergent self-organizing behaviour leads to the clustering of agents with similar strategies, thereby increasing network reciprocity and enhancing group cooperation.
社交困境中网络结构在促进群体合作方面的 significance已经被广泛认可。之前的研究将这种促进归因于空间交互驱动的策略多样化。尽管使用强化学习研究了动态交互对合作演化影响的效应,但关于如何在显式互动结构中发展邻居选择行为和战略组合仍然存在理解不足。为了应对这个问题,我们的研究基于多代理强化学习在囚徒困境游戏中的框架。这个框架允许代理根据其长期经验选择困境策略和互动邻居,与现有的研究不同,后者依赖于预设的社会规范或外部激励。通过使用两个不同的Q网络来建模每个代理,我们解开了合作和互动之间的协同进化动态。结果显示,长期经验使代理能够发展出识别非合作邻居和偏好与积极合作者互动的能力。这种自组织行为导致具有相似策略的代理聚类,从而增加网络互惠和提高群体合作。
https://arxiv.org/abs/2405.02654
Guidance commands of flight vehicles are a series of data sets with fixed time intervals, thus guidance design constitutes a sequential decision problem and satisfies the basic conditions for using deep reinforcement learning (DRL). In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on DRL and the pursuit flight vehicle (PFV) generates guidance commands based on the proportional navigation method. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, subject to the constraint imposed by the given evasion distance. Thus an irregular dynamic max-min problem of extremely large-scale is formulated, where the time instant when the optimal solution can be attained is uncertain and the optimum solution depends on all the intermediate guidance commands generated before. For solving this problem, a two-step strategy is conceived. In the first step, we use the proximal policy optimization (PPO) algorithm to generate the guidance commands of the EFV. The results obtained by PPO in the global search space are coarse, despite the fact that the reward function, the neural network parameters and the learning rate are designed elaborately. Therefore, in the second step, we propose to invoke the evolution strategy (ES) based algorithm, which uses the result of PPO as the initial value, to further improve the quality of the solution by searching in the local space. Simulation results demonstrate that the proposed guidance design method based on the PPO algorithm is capable of achieving a residual velocity of 67.24 m/s, higher than the residual velocities achieved by the benchmark soft actor-critic and deep deterministic policy gradient algorithms. Furthermore, the proposed ES-enhanced PPO algorithm outperforms the PPO algorithm by 2.7\%, achieving a residual velocity of 69.04 m/s.
飞行器 guidance 命令是一系列固定时间间隔的数据集,因此 guidance 设计构成了一个序列决策问题,满足了使用深度强化学习(DRL)的基本条件。在本文中,我们考虑了以下场景:逃逸飞行器(EFV)根据 DRL 生成引导命令,而追击飞行器(PFV)根据比例导航方法生成引导命令。对于 EFV,引导设计的目的是在给定逃逸距离的约束下,逐步最大化剩余速度。因此,我们提出了一个极大规模的动态最优最小问题,其中最优解的时间不确定性,而且最优解取决于在之前生成的所有中间引导命令。为了解决这个问题,我们提出了一个两步策略。在第一步,我们使用 Proximal Policy Optimization (PPO) 算法生成 EFV 的引导命令。虽然 PPO 在全局搜索空间中获得的结果粗略,但奖励函数、神经网络参数和学习率都经过了详细设计。因此,在第二步,我们提出了基于 ES 的进化策略(ES)算法,它基于 PPO 的结果作为初始值,在局部空间中进一步改善解决方案的质量。仿真结果表明,基于 PPO 算法的 proposed guidance design method 能够实现 67.24 m/s 的残余速度,高于基准软actor-critic 和 deep deterministic policy gradient 算法获得的残余速度。此外,与 PPO 算法相比,ES-enhanced PPO 算法提高了 2.7%,实现了 69.04 m/s 的残余速度。
https://arxiv.org/abs/2405.03711
Categorical Distributional Reinforcement Learning (CDRL) has demonstrated superior sample efficiency in learning complex tasks compared to conventional Reinforcement Learning (RL) approaches. However, the practical application of CDRL is encumbered by challenging projection steps, detailed parameter tuning, and domain knowledge. This paper addresses these challenges by introducing a pioneering Continuous Distributional Model-Free RL algorithm tailored for continuous action spaces. The proposed algorithm simplifies the implementation of distributional RL, adopting an actor-critic architecture wherein the critic outputs a continuous probability distribution. Additionally, we propose an ensemble of multiple critics fused through a Kalman fusion mechanism to mitigate overestimation bias. Through a series of experiments, we validate that our proposed method is easy to train and serves as a sample-efficient solution for executing complex continuous-control tasks.
分类分布强化学习(CDRL)在处理复杂任务时具有比传统强化学习(RL)方法更高的样本效率。然而,CDRL的实践应用受到具有挑战性的投影阶段、详细的参数调整和领域知识的限制。本文通过引入一个首创的连续分布模型无关RL算法来解决这些挑战。所提出的算法简化了分布强化学习的实现,采用actor-critic架构,其中批评器输出一个连续概率分布。此外,我们提出了一种通过Kalman融合机制将多个批评器聚类的 ensemble。通过一系列实验验证,我们证实了我们的方法容易训练,并且可作为执行复杂连续控制任务的样本效率解决方案。
https://arxiv.org/abs/2405.02576
Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
基于策略的方法在解决具有挑战性的强化学习问题方面取得了显著的成功。在这些方法中,离线策略梯度方法特别重要,因为它们可以利用离线数据。然而,由于这些方法在离线策略梯度(OPPG)估计器的高方差性,导致训练期间样本效率较差。在本文中,我们提出了一种带有最优动作相关基线的离线策略梯度方法,以减轻这一方差问题。具体来说,这个基准在理论上将OPPG估计器的方差最小化,同时保持其无偏性。为了提高实际计算效率,我们设计了一个近似的最优基准。利用这个近似,我们的方法(Off-OAB)旨在在策略优化过程中降低OPPG估计器的方差。我们在OpenAI Gym和MuJoCo的六个代表性任务上评估了所提出的Off-OAB方法,这些任务中它显著超越了当前的最先进方法。
https://arxiv.org/abs/2405.02572
Curriculum design for reinforcement learning (RL) can speed up an agent's learning process and help it learn to perform well on complex tasks. However, existing techniques typically require domain-specific hyperparameter tuning, involve expensive optimization procedures for task selection, or are suitable only for specific learning objectives. In this work, we consider curriculum design in contextual multi-task settings where the agent's final performance is measured w.r.t. a target distribution over complex tasks. We base our curriculum design on the Zone of Proximal Development concept, which has proven to be effective in accelerating the learning process of RL agents for uniform distribution over all tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent's learning toward the target distribution via leveraging task correlations. We theoretically justify the task selection strategy of ProCuRL-Target by analyzing a simple learning setting with REINFORCE learner model. Our experimental results across various domains with challenging target task distributions affirm the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.
强化学习(RL)中,课程设计可以加速智能体的学习过程,并帮助它学会在复杂任务上表现出色。然而,现有的技术通常需要针对特定领域进行超参数调整,涉及昂贵的任务选择优化过程,或者仅适用于特定的学习目标。在这项工作中,我们考虑在上下文多任务环境中进行课程设计,其中智能体的最终性能是相对于复杂任务的某个目标分布进行衡量的。我们基于上下文多任务环境中 Zone of Proximal Development(ZOPD)的概念进行课程设计,该概念已经被证明在加速所有任务上 RL 智能体的学习过程中非常有效。我们提出了一个名为 ProCuRL-Target 的全新课程,通过利用任务关联有效地平衡了选择任务既不至于过于困难,又不至于无法继续学习目标分布的需求。我们通过分析一个简单的学习场景(REINFORCE 学习者模型)来理论证明 ProCuRL-Target 的任务选择策略。我们在各种具有具有挑战性目标任务分布的实验领域中进行实验,证实了我们的课程策略在加速深度 RL 智能体训练过程方面的有效性超过了最先进的基准。
https://arxiv.org/abs/2405.02481
We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website this https URL .
我们将多智能体深度强化学习(RL)应用于训练具有完全车载计算和感知能力的端到端机器人足球策略,通过采用 ego 中心式 RGB 视觉。这个设置反映了现实世界机器人领域许多挑战,包括积极感知、灵活的全身体控制和动态、部分不可观测的多智能体领域的长距离规划。我们依赖于大规模、基于模拟的数据生成来获得自适应的 behaviors,这些 behaviors 可以成功地传输到物理机器人,利用低成本传感器。为了实现适当的视觉现实,我们的模拟结合了刚体物理和通过多个 Neural Radiance Fields (NeRFs) 学习到的逼真的渲染。我们将基于教师的多智能体 RL 和跨实验数据复用来探索复杂的足球策略。我们分析了在仅优化感知无关足球比赛时出现的积极感知行为,包括物体跟踪和寻找球。代理显示与具有特权、地面真实状态访问权限的政策具有同等的表现和敏捷性。据我们所知,本文是首次将端到端训练多智能体机器人足球的实践,将原始像素观察结果映射到关节级别动作,可以在现实世界中部署。游戏的视频和分析可以在我们的网站 https:// this URL 上查看。
https://arxiv.org/abs/2405.02425
Robotics policies are always subjected to complex, second order dynamics that entangle their actions with resulting states. In reinforcement learning (RL) contexts, policies have the burden of deciphering these complicated interactions over massive amounts of experience and complex reward functions to learn how to accomplish tasks. Moreover, policies typically issue actions directly to controllers like Operational Space Control (OSC) or joint PD control, which induces straightline motion towards these action targets in task or joint space. However, straightline motion in these spaces for the most part do not capture the rich, nonlinear behavior our robots need to exhibit, shifting the burden of discovering these behaviors more completely to the agent. Unlike these simpler controllers, geometric fabrics capture a much richer and desirable set of behaviors via artificial, second order dynamics grounded in nonlinear geometry. These artificial dynamics shift the uncontrolled dynamics of a robot via an appropriate control law to form behavioral dynamics. Behavioral dynamics unlock a new action space and safe, guiding behavior over which RL policies are trained. Behavioral dynamics enable bang-bang-like RL policy actions that are still safe for real robots, simplify reward engineering, and help sequence real-world, high-performance policies. We describe the framework more generally and create a specific instantiation for the problem of dexterous, in-hand reorientation of a cube by a highly actuated robot hand.
机器人策略总是受到复杂、二级非线性动力学的交互作用,这些交互作用会它们的动作与结果状态紧密耦合。在强化学习(RL)背景下,策略有义务通过大量的经验和复杂的奖励函数来揭示这些复杂交互,从而学会完成任务。此外,策略通常直接向诸如操作空间控制(OSC)或联合体比例控制(JPC)等控制器发出动作,这导致在任务或关节空间中产生直线运动,指向这些动作目标。然而,这些空间中的直线运动在很大程度上并没有捕捉到机器人需要展现的丰富、非线性的行为,将发现这些行为的负担更多地转移给代理。 与这些简单的控制器不同,几何面料通过基于非线性几何的人工第二级动力学捕捉到一个更丰富、更具有吸引力的行为集合。这些人工动力学通过适当的控制律将机器人的无控制动力学转移到行为动态中。行为动态开辟了一个新的动作空间,并引导机器人通过RL策略进行安全、引导行为。行为动态简化了奖励工程,并帮助实现真实世界的高性能策略。 我们更一般地描述这个框架,并为掌握高度激活的机器人手进行立方体灵活调整的问题创建了一个具体的实例。
https://arxiv.org/abs/2405.02250
Compact robotic platforms with powerful compute and actuation capabilities are key enablers for practical, real-world deployments of multi-agent research. This article introduces a tightly integrated hardware, control, and simulation software stack on a fleet of holonomic ground robot platforms designed with this motivation. Our robots, a fleet of customised DJI Robomaster S1 vehicles, offer a balance between small robots that do not possess sufficient compute or actuation capabilities and larger robots that are unsuitable for indoor multi-robot tests. They run a modular ROS2-based optimal estimation and control stack for full onboard autonomy, contain ad-hoc peer-to-peer communication infrastructure, and can zero-shot run multi-agent reinforcement learning (MARL) policies trained in our vectorized multi-agent simulation framework. We present an in-depth review of other platforms currently available, showcase new experimental validation of our system's capabilities, and introduce case studies that highlight the versatility and reliabilty of our system as a testbed for a wide range of research demonstrations. Our system as well as supplementary material is available online: this https URL
紧凑型机器人平台具有强大的计算和执行能力是多智能体研究实际应用的关键推动力。本文介绍了一种基于holonomic地面机器人平台的设计,该平台具有强大的计算和执行能力,以实现多智能体研究的实际应用。我们的机器人,是一支由DJI Robomaster S1车辆组成的定制车队,提供小机器人不具备足够的计算或执行能力,以及不适合室内多机器人测试的较大机器人的平衡。它们运行了一个基于ROS2的模块化最优估计和控制栈来实现全车载自主,包含一个自适应的点对点通信基础设施,并且可以通过我们的向量式多智能体仿真框架训练的多智能体强化学习(MARL)策略实现零 shots。我们深入研究了其他可用的平台,展示了我们系统能力的实验验证,并引入了案例研究,突出了我们系统作为各种研究展示平台的多样性和可靠性。我们的系统和补充材料均可在线获取:https://www. this URL
https://arxiv.org/abs/2405.02198
Agent-based models (ABMs) are simulation models used in economics to overcome some of the limitations of traditional frameworks based on general equilibrium assumptions. However, agents within an ABM follow predetermined, not fully rational, behavioural rules which can be cumbersome to design and difficult to justify. Here we leverage multi-agent reinforcement learning (RL) to expand the capabilities of ABMs with the introduction of fully rational agents that learn their policy by interacting with the environment and maximising a reward function. Specifically, we propose a 'Rational macro ABM' (R-MABM) framework by extending a paradigmatic macro ABM from the economic literature. We show that gradually substituting ABM firms in the model with RL agents, trained to maximise profits, allows for a thorough study of the impact of rationality on the economy. We find that RL agents spontaneously learn three distinct strategies for maximising profits, with the optimal strategy depending on the level of market competition and rationality. We also find that RL agents with independent policies, and without the ability to communicate with each other, spontaneously learn to segregate into different strategic groups, thus increasing market power and overall profits. Finally, we find that a higher degree of rationality in the economy always improves the macroeconomic environment as measured by total output, depending on the specific rational policy, this can come at the cost of higher instability. Our R-MABM framework is general, it allows for stable multi-agent learning, and represents a principled and robust direction to extend existing economic simulators.
基于代理的模型(ABMs)是一种用于经济学中的模拟模型,以克服基于一般均衡假设的传统框架的一些局限性。然而,ABM中的代理遵循预设的、非完全理性的行为规则,这可能会导致设计复杂且难以证明。在这里,我们利用多智能体强化学习(RL)来通过引入完全理性的代理来扩展ABM的功能,使得代理通过与环境交互并最大化奖励函数来学习其策略。具体来说,我们提出了一个“理性宏观ABM”框架,该框架在经济学文献中是对典型宏观ABM的扩展。我们证明了,逐步用RL代理商替换模型中的ABM企业,训练以最大化利润,可以深入研究理性对经济的影响。我们发现,RL代理商自发地学习三种最大化利润的策略,最优策略取决于市场竞争的水平和理性程度。我们还发现,具有独立策略的RL代理商,以及无法相互通信的代理商,会自发地学习将企业划分为不同的战略组,从而增加市场实力和整体利润。最后,我们发现,经济中的理性程度越高,总产出水平越大,这取决于具体的理性政策,但这也带来了更高的不稳定性。我们的R-MABM框架是通用的,它允许稳定的多智能体学习,并且代表了扩展现有经济模拟器的一个理性和稳健的方向。
https://arxiv.org/abs/2405.02161
Recent work has shown that deep neural networks are capable of approximating both value functions and policies in reinforcement learning domains featuring continuous state and action spaces. However, to the best of our knowledge no previous work has succeeded at using deep neural networks in structured (parameterized) continuous action spaces. To fill this gap, this paper focuses on learning within the domain of simulated RoboCup soccer, which features a small set of discrete action types, each of which is parameterized with continuous variables. The best learned agent can score goals more reliably than the 2012 RoboCup champion agent. As such, this paper represents a successful extension of deep reinforcement learning to the class of parameterized action space MDPs.
近年来,研究表明,深度神经网络在具有连续状态和动作空间的强化学习领域中可以逼近价值函数和策略。然而,据我们所知,在具有结构化(参数化)连续动作空间的领域中,还没有前人成功地将深度神经网络应用于其中。为了填补这一空白,本文将专注于在模拟RoboCup足球领域中学习,该领域具有较小的离散动作类型,并且每个动作类型都由连续变量进行参数化。通过训练,可以获得比2012 RoboCup冠军代理商更可靠的得分能力。因此,本文代表了对参数化动作空间MDPs的深度强化学习的成功扩展。
https://arxiv.org/abs/1511.04143
Robust Reinforcement Learning (RRL) is a promising Reinforcement Learning (RL) paradigm aimed at training robust to uncertainty or disturbances models, making them more efficient for real-world applications. Following this paradigm, uncertainty or disturbances are interpreted as actions of a second adversarial agent, and thus, the problem is reduced to seeking the agents' policies robust to any opponent's actions. This paper is the first to propose considering the RRL problems within the positional differential game theory, which helps us to obtain theoretically justified intuition to develop a centralized Q-learning approach. Namely, we prove that under Isaacs's condition (sufficiently general for real-world dynamical systems), the same Q-function can be utilized as an approximate solution of both minimax and maximin Bellman equations. Based on these results, we present the Isaacs Deep Q-Network algorithms and demonstrate their superiority compared to other baseline RRL and Multi-Agent RL algorithms in various environments.
鲁棒强化学习(RRL)是一个有前景的强化学习(RL)范式,旨在训练对不确定或扰动具有鲁棒性的模型,使其在现实应用中更加高效。遵循这一范式,不确定性或扰动被解释为第二个对抗代理的行动,因此问题 reduction为寻求具有对任何对手行动鲁棒的代理策略。本文是第一个考虑在位置微分游戏理论中提出RRL问题的论文,这有助于我们获得理论证明,以开发一种集中式Q学习方法。具体来说,我们证明了在Isaacs的条件下(对于现实世界动态系统足够通用),相同Q函数可以作为最小最大Bellman方程的近似解。基于这些结果,我们提出了Isaacs深度Q网络算法,并在各种环境中证明了它们与其他基线RRL和多代理器RL算法的优越性。
https://arxiv.org/abs/2405.02044
Designing protein nanomaterials of predefined shape and characteristics has the potential to dramatically impact the medical industry. Machine learning (ML) has proven successful in protein design, reducing the need for expensive wet lab experiment rounds. However, challenges persist in efficiently exploring the protein fitness landscapes to identify optimal protein designs. In response, we propose the use of AlphaZero to generate protein backbones, meeting shape and structural scoring requirements. We extend an existing Monte Carlo tree search (MCTS) framework by incorporating a novel threshold-based reward and secondary objectives to improve design precision. This innovation considerably outperforms existing approaches, leading to protein backbones that better respect structural scores. The application of AlphaZero is novel in the context of protein backbone design and demonstrates promising performance. AlphaZero consistently surpasses baseline MCTS by more than 100% in top-down protein design tasks. Additionally, our application of AlphaZero with secondary objectives uncovers further promising outcomes, indicating the potential of model-based reinforcement learning (RL) in navigating the intricate and nuanced aspects of protein design
设计具有预定形状和特性的蛋白质纳米材料,将对医疗行业产生重大影响。机器学习(ML)在蛋白质设计方面取得了成功,减少了昂贵的湿实验室实验轮数。然而,在 efficiently 探索蛋白质适应度景观以确定最优蛋白质设计方面仍然存在挑战。因此,我们提出了使用AlphaZero生成蛋白质骨架,满足形状和结构评分要求。我们通过引入一种新颖的基于阈值的奖励和次要目标扩展了现有的蒙特卡洛树搜索(MCTS)框架,以提高设计精度。这一创新大大超过了现有方法,导致设计出的蛋白质骨架更尊重结构评分。AlphaZero在蛋白质骨架设计方面的应用是新颖的,并展示了具有潜力的应用。AlphaZero在自顶向下蛋白质设计任务中 consistently超过基线MCTS超过100%。此外,我们使用AlphaZero进行次要目标的应用揭示了进一步 promising 的结果,表明了基于模型的强化学习(RL)在导航蛋白质设计中的复杂和细微方面具有潜力。
https://arxiv.org/abs/2405.01983
The neural combinatorial optimization (NCO) approach has shown great potential for solving routing problems without the requirement of expert knowledge. However, existing constructive NCO methods cannot directly solve large-scale instances, which significantly limits their application prospects. To address these crucial shortcomings, this work proposes a novel Instance-Conditioned Adaptation Model (ICAM) for better large-scale generalization of neural combinatorial optimization. In particular, we design a powerful yet lightweight instance-conditioned adaptation module for the NCO model to generate better solutions for instances across different scales. In addition, we develop an efficient three-stage reinforcement learning-based training scheme that enables the model to learn cross-scale features without any labeled optimal solution. Experimental results show that our proposed method is capable of obtaining excellent results with a very fast inference time in solving Traveling Salesman Problems (TSPs) and Capacitated Vehicle Routing Problems (CVRPs) across different scales. To the best of our knowledge, our model achieves state-of-the-art performance among all RL-based constructive methods for TSP and CVRP with up to 1,000 nodes.
神经组合优化(NCO)方法在解决无需专家知识的路由问题方面具有巨大的潜力。然而,现有的构建性NCO方法无法直接解决大规模实例,这严重限制了它们的应用前景。为了应对这些关键不足,本文提出了一个新的事例条件适应模型(ICAM)来更好地处理大规模神经组合优化。 特别是,我们为NCO模型设计了一个强大而轻量级的实例条件适应模块,以生成更好的实例解。此外,我们开发了一个高效的基于强化学习的三阶段训练计划,使模型能够在没有任何有标签最优解的情况下学习跨规模特征。 实验结果表明,与不同规模下的旅行商问题(TSPs)和容量车辆路由问题(CVRPs)相比,我们提出的方法具有极快的推理速度。据我们所知,我们的模型在所有基于RL的构建性方法中取得了最先进的性能,节点数量达到1000个以上。
https://arxiv.org/abs/2405.01906
Recommender selects and presents top-K items to the user at each online request, and a recommendation session consists of several sequential requests. Formulating a recommendation session as a Markov decision process and solving it by reinforcement learning (RL) framework has attracted increasing attention from both academic and industry communities. In this paper, we propose a RL-based industrial short-video recommender ranking framework, which models and maximizes user watch-time in an environment of user multi-aspect preferences by a collaborative multi-agent formulization. Moreover, our proposed framework adopts a model-based learning approach to alleviate the sample selection bias which is a crucial but intractable problem in industrial recommender system. Extensive offline evaluations and live experiments confirm the effectiveness of our proposed method over alternatives. Our proposed approach has been deployed in our real large-scale short-video sharing platform, successfully serving over hundreds of millions users.
推荐系统选择并提供给用户每个在线请求的前K个物品,而推荐会话包括多个连续的请求。将推荐会话表示为一个马尔可夫决策过程并通过强化学习(RL)框架求解,已经吸引了学术界和工业界的越来越多的关注。在本文中,我们提出了一个基于强化学习的工业短视频推荐排名框架,该框架通过多智能体协同公式建模并最大化用户的观看时间来模拟和优化用户的多方面偏好。此外,我们所提出的框架采用基于模型的学习方法来减轻工业推荐系统中样本选择偏差这个关键但难以解决的问题。 广泛的离线评估和现场实验证实了我们所提出方法的有效性超过了其他替代方案。我们的方法已经成功部署在我们的大型视频分享平台上,为数超过数百万用户提供了服务。
https://arxiv.org/abs/2405.01847
Multi-agent systems (MAS) need to adaptively cope with dynamic environments, changing agent populations, and diverse tasks. However, most of the multi-agent systems cannot easily handle them, due to the complexity of the state and task space. The social impact theory regards the complex influencing factors as forces acting on an agent, emanating from the environment, other agents, and the agent's intrinsic motivation, referring to the social force. Inspired by this concept, we propose a novel gradient-based state representation for multi-agent reinforcement learning. To non-trivially model the social forces, we further introduce a data-driven method, where we employ denoising score matching to learn the social gradient fields (SocialGFs) from offline samples, e.g., the attractive or repulsive outcomes of each force. During interactions, the agents take actions based on the multi-dimensional gradients to maximize their own rewards. In practice, we integrate SocialGFs into the widely used multi-agent reinforcement learning algorithms, e.g., MAPPO. The empirical results reveal that SocialGFs offer four advantages for multi-agent systems: 1) they can be learned without requiring online interaction, 2) they demonstrate transferability across diverse tasks, 3) they facilitate credit assignment in challenging reward settings, and 4) they are scalable with the increasing number of agents.
多智能体系统(MAS)需要适应性地应对动态环境、变化的人工智能体种和多样化的任务。然而,大多数MAS无法轻松处理这些复杂状态和任务空间。社会影响理论将复杂的影响力因素视为作用于智能体、来自环境、其他智能体以及智能体内在动机的力量,即社会力。受到这个概念的启发,我们提出了一个新颖的基于梯度的多智能体强化学习状态表示。为了非平凡地建模社会力,我们进一步引入了一种数据驱动的方法,其中我们使用去噪评分匹配来从离线样本中学习社会梯度场(SocialGFs),例如每个力的吸引或排斥后果。在交互过程中,智能体根据多维梯度采取行动,以最大化自己的奖励。在实践中,我们将SocialGFs集成到广泛使用的多智能体强化学习算法中,如MAPPO。实证结果表明,SocialGFs对多智能体系统具有以下四个优点:1)无需在线交互即可学习,2)展示了跨多样化任务的传递性,3)有助于在具有挑战性的奖励设置中进行信道分配,4)随着智能体数量的增加,具有可扩展性。
https://arxiv.org/abs/2405.01839
Autonomous wheeled-legged robots have the potential to transform logistics systems, improving operational efficiency and adaptability in urban environments. Navigating urban environments, however, poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation. These challenges include the need for adaptive locomotion across varied terrains and the ability to navigate efficiently around complex dynamic obstacles. This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city. Using model-free reinforcement learning (RL) techniques and privileged learning, we develop a versatile locomotion controller. This controller achieves efficient and robust locomotion over various rough terrains, facilitated by smooth transitions between walking and driving modes. It is tightly integrated with a learned navigation controller through a hierarchical RL framework, enabling effective navigation through challenging terrain and various obstacles at high speed. Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain. These missions demonstrate the system's robustness and adaptability, underscoring the importance of integrated control systems in achieving seamless navigation in complex environments. Our findings support the feasibility of wheeled-legged robots and hierarchical RL for autonomous navigation, with implications for last-mile delivery and beyond.
自动驾驶轮式机器人具有潜力彻底改变物流系统,提高操作效率和适应城市环境的灵活性。然而,在导航城市环境中还存在独特的挑战,对机器人的运动和导航提出了创新解决方案。这些挑战包括在各种地形上进行自适应运动以及高效地围绕复杂动态障碍物进行导航。本文介绍了一种集成系统,包括自适应运动控制、面向移动性的局部路径规划和城市规模路径规划。我们使用基于模型无关强化学习(RL)技术和优先学习方法开发了一个多功能的运动控制器。该控制器在各种崎岖不平的地面上实现高效的稳健运动,得益于平滑的步行和驾驶模式之间的转换。它与通过分层的RL框架集成的学习导航控制器紧密集成,使机器人能够有效通过具有挑战性的地形和各种障碍物的高速导航。我们的控制器被集成到大型城市导航系统中,并通过瑞士苏黎世和西班牙塞维利亚等地进行的自主、公里级导航任务进行了验证。这些任务突显了系统的稳健性和适应性,进一步强调了集成控制系统在复杂环境中实现无缝导航的重要性。我们的研究结果支持轮式机器人的可行性和层次式RL在自主导航方面的应用,这对末端交付和更广阔的应用领域都有重要的意义。
https://arxiv.org/abs/2405.01792
In recent years, semi-supervised learning (SSL) has gained significant attention due to its ability to leverage both labeled and unlabeled data to improve model performance, especially when labeled data is scarce. However, most current SSL methods rely on heuristics or predefined rules for generating pseudo-labels and leveraging unlabeled data. They are limited to exploiting loss functions and regularization methods within the standard norm. In this paper, we propose a novel Reinforcement Learning (RL) Guided SSL method, RLGSSL, that formulates SSL as a one-armed bandit problem and deploys an innovative RL loss based on weighted reward to adaptively guide the learning process of the prediction model. RLGSSL incorporates a carefully designed reward function that balances the use of labeled and unlabeled data to enhance generalization performance. A semi-supervised teacher-student framework is further deployed to increase the learning stability. We demonstrate the effectiveness of RLGSSL through extensive experiments on several benchmark datasets and show that our approach achieves consistent superior performance compared to state-of-the-art SSL methods.
近年来,半监督学习(SSL)因能够利用标记数据和未标记数据来提高模型性能而受到广泛关注。然而,大多数当前的SSL方法都依赖于规则或预定义的算法生成伪标签,并利用未标记数据。它们只能在标准正则约束内利用损失函数和正则化方法。在本文中,我们提出了一种新颖的强化学习(RL)引导的SSL方法,RLGSSL,将SSL表示为带有一只手边的带约束的动态规划问题,并部署了一种基于加权奖励的创新型RL损失以适应性地指导预测模型的学习过程。RLGSSL包括一个精心设计的奖励函数,可以平衡使用标记数据和未标记数据来提高泛化性能。还进一步部署了一个半监督的教师-学生框架以增加学习稳定性。通过在多个基准数据集上的广泛实验,我们证明了RLGSSL的有效性,并表明我们的方法与最先进的SSL方法相比具有 consistently superior performance。
https://arxiv.org/abs/2405.01760
In the real world, the strong episode resetting mechanisms that are needed to train agents in simulation are unavailable. The \textit{resetting} assumption limits the potential of reinforcement learning in the real world, as providing resets to an agent usually requires the creation of additional handcrafted mechanisms or human interventions. Recent work aims to train agents (\textit{forward}) with learned resets by constructing a second (\textit{backward}) agent that returns the forward agent to the initial state. We find that the termination and timing of the transitions between these two agents are crucial for algorithm success. With this in mind, we create a new algorithm, Reset Free RL with Intelligently Switching Controller (RISC) which intelligently switches between the two agents based on the agent's confidence in achieving its current goal. Our new method achieves state-of-the-art performance on several challenging environments for reset-free RL.
在现实世界中,需要训练代理在模拟中进行强化学习所需的强大 episode reset 机制是不可用的。对齐假设限制了在现实生活中强化学习的潜力,因为提供给代理的重新开始通常需要创建额外的手工机制或人类干预。最近的工作旨在通过构建一个返回初始状态的第二个代理来训练代理(前向代理),我们发现在这两个代理之间转换的终止和时序对算法的成功至关重要。因此,我们创建了一个名为Reset Free RL with Intelligently Switching Controller (RISC)的新算法,该算法根据代理实现其当前目标的信心智能地切换这两个代理。我们的新方法在无重新开始的情况下取得了对几个具有挑战性的环境的最佳性能。
https://arxiv.org/abs/2405.01684
Ensuring the safety of Reinforcement Learning (RL) is crucial for its deployment in real-world applications. Nevertheless, managing the trade-off between reward and safety during exploration presents a significant challenge. Improving reward performance through policy adjustments may adversely affect safety performance. In this study, we aim to address this conflicting relation by leveraging the theory of gradient manipulation. Initially, we analyze the conflict between reward and safety gradients. Subsequently, we tackle the balance between reward and safety optimization by proposing a soft switching policy optimization method, for which we provide convergence analysis. Based on our theoretical examination, we provide a safe RL framework to overcome the aforementioned challenge, and we develop a Safety-MuJoCo Benchmark to assess the performance of safe RL algorithms. Finally, we evaluate the effectiveness of our method on the Safety-MuJoCo Benchmark and a popular safe benchmark, Omnisafe. Experimental results demonstrate that our algorithms outperform several state-of-the-art baselines in terms of balancing reward and safety optimization.
确保强化学习(RL)的安全性对于其在现实应用中的部署至关重要。然而,在探索过程中管理奖励和安全之间的权衡是一个具有挑战性的问题。通过调整策略来提高奖励性能可能会对安全性造成不利影响。在这项研究中,我们旨在通过利用梯度操纵理论来解决这种矛盾关系。首先,我们分析了奖励和安全梯度之间的冲突。接着,我们通过提出软切换策略优化方法来解决奖励和安全优化之间的平衡,并为该方法提供了收敛分析。根据我们的理论审查,我们提供了一个安全的RL框架来克服前述挑战,并开发了一个Safety-MuJoCo基准来评估安全RL算法的性能。最后,我们在Safety-MuJoCo基准和流行的安全基准Omnisafe上评估了我们方法的有效性。实验结果表明,我们的算法在平衡奖励和安全优化方面优于多个最先进的基线。
https://arxiv.org/abs/2405.01677
Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at this https URL
大语言模型(LLMs)已经被证明在长时间的机器人任务中具有执行高级计划的能力。然而,现有的方法需要访问预定义的技能库(例如抓取、放置、拖动、推开、导航)。然而,LLM计划并没有解决如何设计或学习这些行为,这使得在长时间设置中解决这个问题变得更加具有挑战性。此外,对于许多感兴趣的任务,机器人需要能够以细粒度的方式调整其行为,要求代理具备修改低级控制动作的能力。我们可以 instead 使用LLM在高级策略上进行知识表示,指导强化学习(RL)策略有效地解决机器人控制任务,而无需预先确定一组技能?在本文中,我们提出了Plan-Seq-Learn(PSL):一种模块化方法,使用运动规划来桥接抽象语言和学习的低级控制,以从零开始解决长时间的机器人任务。我们证明了PSL在超过25个具有挑战性的机器人任务上取得了最先进的成果,其中包括10个阶段。PSL通过从成功的视觉输入中解决长期机器人任务,其成功率超过85%,超过了基于语言的传统方法、基于任务的端到端方法和基于知识的方法。视频结果和代码在此处:https://url
https://arxiv.org/abs/2405.01534