Many real-world decision-making tasks, such as safety-critical scenarios, cannot be fully described in a single-objective setting using the Markov Decision Process (MDP) framework, as they include hard constraints. These can instead be modeled with additional cost functions within the Constrained Markov Decision Process (CMDP) framework. Even though CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches use Monte Carlo cost estimates to avoid constraint violations. However, these suffer from high variance which results in conservative performance with respect to costs. We propose Constrained MCTS (C-MCTS), an algorithm that estimates cost using a safety critic. The safety critic training is based on Temporal Difference learning in an offline phase prior to agent deployment. This critic limits the exploration of the search tree and removes unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards compared to previous work. As a nice byproduct, the planner is more efficient requiring fewer planning steps. Most importantly, we show that under model mismatch between the planner and the real world, our approach is less susceptible to cost violations than previous work.
许多现实世界的决策任务,如安全关键场景,无法在单一目标环境下使用Markov决策过程(MDP)框架完全描述,因为它们包含硬约束。这些可以而是在MDP框架内使用额外的成本函数建模。尽管MDP在 reinforcement learning 文献中已经广泛研究,但很少 attention 被给予了基于采样的计划算法,如 MCTS 来解决它们。以往的方法使用蒙特卡罗成本估计以避免约束违反。但是,这些受到高方差影响,导致在成本方面表现保守。我们建议使用 Constrained MCTS(C-MCTS)算法,这是一种使用安全批评估计成本的计划算法。安全批评训练基于在代理部署之前进行的离线阶段的时间差异学习。该批评限制了搜索树的探索,并在部署期间在 MCTS 内删除不安全的路径。C-MCTS满足了成本约束,但更接近约束边界,比以前的工作实现了更高的奖励。作为一种美好的结果,规划器更高效,需要更少的计划步骤。最重要的是,我们表明,在规划器和现实世界模型不匹配的情况下,我们的方法比以前的工作更容易违反成本。
https://arxiv.org/abs/2305.16209
We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in tabular environments and a continuous driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior and that can be transferred to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.
我们提出了Convex Constraint Learning for Reinforcement Learning (CoCoRL)方法,用于从可能不同的奖励函数安全的演示集合中推断约束共同性,以约束 Markov决策过程(CMDP)中的受限制条件。尽管以前的研究局限于已知奖励或完全已知环境动态的演示,但CoCoRL可以从不同未知的未知奖励的演示中学习约束,而不需要对环境动态的了解。CoCoRL基于演示构建一个Convex 安全集,即使潜在的最优演示(但安全)也可以保证安全性,并导致无政策 regret。在表格环境中以及多个约束的持续驾驶模拟中,我们评估了CoCoRL。CoCoRL学习导致安全驾驶行为的约束,可以将其转移到不同的任务和环境。相比之下,基于逆 Reinforcement Learning (IRL)的替代方法通常表现出不良性能,并学习不安全的政策。
https://arxiv.org/abs/2305.16147
Most reinforcement learning algorithms treat the context under which they operate as a stationary, isolated and undisturbed environment. However, in the real world, the environment is constantly changing due to a variety of external influences. To address this problem, we study Markov Decision Processes (MDP) under the influence of an external temporal process. We formalize this notion and discuss conditions under which the problem becomes tractable with suitable solutions. We propose a policy iteration algorithm to solve this problem and theoretically analyze its performance.
大多数强化学习算法都将其操作的环境视为静态、独立且不受干扰的环境。然而,在现实中,环境由于多种外部影响而 constantly 发生变化。为了解决这一问题,我们研究受外部时间过程影响 Markov决策过程(MDP)。我们 formalize this notion 并讨论了当问题可用适当的解决方案时变得可计算的条件。我们提出了一种 policy iteration 算法来解决这一问题,并理论分析了其性能。
https://arxiv.org/abs/2305.16056
Biological nervous systems consist of networks of diverse, sophisticated information processors in the form of neurons of different classes. In most artificial neural networks (ANNs), neural computation is abstracted to an activation function that is usually shared between all neurons within a layer or even the whole network; training of ANNs focuses on synaptic optimization. In this paper, we propose the optimization of neuro-centric parameters to attain a set of diverse neurons that can perform complex computations. Demonstrating the promise of the approach, we show that evolving neural parameters alone allows agents to solve various reinforcement learning tasks without optimizing any synaptic weights. While not aiming to be an accurate biological model, parameterizing neurons to a larger degree than the current common practice, allows us to ask questions about the computational abilities afforded by neural diversity in random neural networks. The presented results open up interesting future research directions, such as combining evolved neural diversity with activity-dependent plasticity.
生物神经系统由不同类别的神经元组成的复杂信息处理网络。在大多数人工神经网络(ANNs)中,神经元的计算被抽象为激活函数,通常在一个层或整个网络中的所有神经元共享;ANN的训练重点是连接优化。在本文中,我们提出了优化神经中心参数以实现一组具有复杂计算能力的多样化神经元的建议。演示了该方法的潜力,我们表明,仅通过进化神经参数,代理可以在各种强化学习任务中解决问题,而无需优化任何连接权重。虽然并不旨在成为准确的生物模型,但将神经元参数化程度远远超过当前的常见做法,可以让我们问有关随机神经网络中神经元多样性的计算能力的问题。呈现的结果开创了有趣的未来研究方向,例如将进化的神经多样性与活动依赖可塑性相结合。
https://arxiv.org/abs/2305.15945
A successful tactic that is followed by the scientific community for advancing AI is to treat games as problems, which has been proven to lead to various breakthroughs. We adapt this strategy in order to study Rocket League, a widely popular but rather under-explored 3D multiplayer video game with a distinct physics engine and complex dynamics that pose a significant challenge in developing efficient and high-performance game-playing agents. In this paper, we present Lucy-SKG, a Reinforcement Learning-based model that learned how to play Rocket League in a sample-efficient manner, outperforming by a notable margin the two highest-ranking bots in this game, namely Necto (2022 bot champion) and its successor Nexto, thus becoming a state-of-the-art agent. Our contributions include: a) the development of a reward analysis and visualization library, b) novel parameterizable reward shape functions that capture the utility of complex reward types via our proposed Kinesthetic Reward Combination (KRC) technique, and c) design of auxiliary neural architectures for training on reward prediction and state representation tasks in an on-policy fashion for enhanced efficiency in learning speed and performance. By performing thorough ablation studies for each component of Lucy-SKG, we showed their independent effectiveness in overall performance. In doing so, we demonstrate the prospects and challenges of using sample-efficient Reinforcement Learning techniques for controlling complex dynamical systems under competitive team-based multiplayer conditions.
科学界采取的成功策略是将游戏视为问题,这一策略已被证明可以带来各种突破。我们适应这一策略是为了研究备受欢迎但被忽视的3D多人互动视频游戏《 Rocket League》,这个游戏有一个独特的物理引擎和复杂的动态系统,在开发高效、高性能的游戏扮演代理方面面临着巨大的挑战。在本文中,我们介绍了Lucy-SKG,这是一种基于强化学习模型的代理,通过高效的样本学习,以显著优势超越了这个游戏中排名最高的两个机器人,分别是Necto(2022年机器人冠军)和其后继者Nexto,因此成为最先进的代理。我们的贡献包括:a)开发奖励分析和可视化库;b) novel可参数化的奖励形状函数,通过我们提出的触觉奖励组合(KRC)技术捕捉复杂的奖励类型的价值;c)设计辅助神经网络架构,以在政策模式下训练奖励预测和状态表示任务,以提高学习速度和表现效率。通过进行每个组件的全面 ablation研究,我们展示了它们在整体表现中的独立有效性。在此过程中,我们展示了使用高效的样本学习技术控制复杂动态系统在竞争团队间多人互动条件下的前景和挑战。
https://arxiv.org/abs/2305.15801
While distributional reinforcement learning (RL) has demonstrated empirical success, the question of when and why it is beneficial has remained unanswered. In this work, we provide one explanation for the benefits of distributional RL through the lens of small-loss bounds, which scale with the instance-dependent optimal cost. If the optimal cost is small, our bounds are stronger than those from non-distributional approaches. As warmup, we show that learning the cost distribution leads to small-loss regret bounds in contextual bandits (CB), and we find that distributional CB empirically outperforms the state-of-the-art on three challenging tasks. For online RL, we propose a distributional version-space algorithm that constructs confidence sets using maximum likelihood estimation, and we prove that it achieves small-loss regret in the tabular MDPs and enjoys small-loss PAC bounds in latent variable models. Building on similar insights, we propose a distributional offline RL algorithm based on the pessimism principle and prove that it enjoys small-loss PAC bounds, which exhibit a novel robustness property. For both online and offline RL, our results provide the first theoretical benefits of learning distributions even when we only need the mean for making decisions.
尽管分布强化学习(RL)已经证明了经验成功,但何时和如何有益的问题仍未得到回答。在本文中,我们提供了一个解释分布强化学习的益处,通过小损失限的视角,该视角与实例依存最优成本的平方根成正比。如果最优成本较小,我们的限就比非分布方法更强。作为热身,我们展示了学习成本分布会导致 contextual bandits(CB)中的小损失 regret 限,并且我们发现分布CB在三个具有挑战性的任务中Empirically outperforms the state-of-the-art。对于在线 RL,我们提出了一种分布版本的空间算法,使用最大似然估计来构建信心集合,并证明它在排列的MDP中实现小损失 regret 限,并在隐变量模型中实现小损失 PAC 限。基于类似的想法,我们提出了基于悲观主义的分布 offline RL 算法,并证明它享受小损失 PAC 限,表现出一种新的稳健性质。对于在线和 offline RL,我们的结果提供了学习分布的第一种理论益处,即使我们只需要平均值来做出决策。
https://arxiv.org/abs/2305.15703
Offline-to-online reinforcement learning (RL), by combining the benefits of offline pretraining and online finetuning, promises enhanced sample efficiency and policy performance. However, existing methods, effective as they are, suffer from suboptimal performance, limited adaptability, and unsatisfactory computational efficiency. We propose a novel framework, PROTO, which overcomes the aforementioned limitations by augmenting the standard RL objective with an iteratively evolving regularization term. Performing a trust-region-style update, PROTO yields stable initial finetuning and optimal final performance by gradually evolving the regularization term to relax the constraint strength. By adjusting only a few lines of code, PROTO can bridge any offline policy pretraining and standard off-policy RL finetuning to form a powerful offline-to-online RL pathway, birthing great adaptability to diverse methods. Simple yet elegant, PROTO imposes minimal additional computation and enables highly efficient online finetuning. Extensive experiments demonstrate that PROTO achieves superior performance over SOTA baselines, offering an adaptable and efficient offline-to-online RL framework.
离线到在线强化学习(RL),通过结合离线预训练和在线微调的优势,有望提高样本效率和决策表现。然而,现有的方法尽管有效,却面临着最优性能、有限适应性和不满意的计算效率。我们提出了一个 novel 框架,称为 Proto,它能够克服上述限制,通过增加标准RL目标并迭代演化 Regularization term 来增强其性能。采用信任区域更新方式,Proto 能够稳定地初始化微调并实现最佳的最终表现,通过逐渐演化 Regularization term 来放松约束强度。通过仅仅调整几个代码行,Proto 可以桥接任何离线策略预训练和标准非策略RL微调,形成一个强大的离线到在线RL路径,为各种方法带来了极大的适应性。简单但优雅,Proto 仅增加了少量的额外计算,并实现了高效的在线微调。广泛的实验表明,Proto 超越了SOTA基准,提供了一种适应性和高效的离线到在线RL框架。
https://arxiv.org/abs/2305.15669
Reinforcement learning (RL) is an area of significant research interest, and safe RL in particular is attracting attention due to its ability to handle safety-driven constraints that are crucial for real-world applications. This work proposes a novel approach to RL training, called control invariant set (CIS) enhanced RL, which leverages the advantages of utilizing the explicit form of CIS to improve stability guarantees and sampling efficiency. Furthermore, the robustness of the proposed approach is investigated in the presence of uncertainty. The approach consists of two learning stages: offline and online. In the offline stage, CIS is incorporated into the reward design, initial state sampling, and state reset procedures. This incorporation of CIS facilitates improved sampling efficiency during the offline training process. In the online stage, RL is retrained whenever the predicted next step state is outside of the CIS, which serves as a stability criterion, by introducing a Safety Supervisor to examine the safety of the action and make necessary corrections. The stability analysis is conducted for both cases, with and without uncertainty. To evaluate the proposed approach, we apply it to a simulated chemical reactor. The results show a significant improvement in sampling efficiency during offline training and closed-loop stability guarantee in the online implementation, with and without uncertainty.
强化学习(RL)是一个具有重要研究兴趣的领域,特别是安全RL备受关注,因为它能够处理对于实际应用程序至关重要的安全驱动约束。这项工作提出了一种新颖的RL训练方法,称为控制不变集(CIS)增强RL,该方法利用CIS的显式形式来提高稳定性保证和采样效率。此外,在存在不确定性的情况下,该方法研究了 proposed 方法的鲁棒性。方法分为两个学习阶段: offline 和 online。在 offline 阶段,CIS 被嵌入到奖励设计、初始状态采样和状态重置程序中。这种方法的嵌入在 offline 训练过程中促进了更好的采样效率。在 online 阶段,每次预测的下一个状态都超出了 CIS,它作为稳定性准则,引入了安全主管来检查行动的安全性并进行必要的纠正。稳定性分析针对既有不确定性又有不确定性两种情况进行了研究。为了评估所提出的方法,我们将其应用于模拟化学反应堆。结果表明,在 offline 训练期间,采样效率显著提高,而在 online 实施中,闭循环稳定性保证也有了显著改善,无论存在与否不确定性。
https://arxiv.org/abs/2305.15602
In this work we consider a generalization of the well-known multivehicle routing problem: given a network, a set of agents occupying a subset of its nodes, and a set of tasks, we seek a minimum cost sequence of movements subject to the constraint that each task is visited by some agent at least once. The classical version of this problem assumes a central computational server that observes the entire state of the system perfectly and directs individual agents according to a centralized control scheme. In contrast, we assume that there is no centralized server and that each agent is an individual processor with no a priori knowledge of the underlying network (including task and agent locations). Moreover, our agents possess strictly local communication and sensing capabilities (restricted to a fixed radius around their respective locations), aligning more closely with several real-world multiagent applications. These restrictions introduce many challenges that are overcome through local information sharing and direct coordination between agents. We present a fully distributed, online, and scalable reinforcement learning algorithm for this problem whereby agents self-organize into local clusters and independently apply a multiagent rollout scheme locally to each cluster. We demonstrate empirically via extensive simulations that there exists a critical sensing radius beyond which the distributed rollout algorithm begins to improve over a greedy base policy. This critical sensing radius grows proportionally to the $\log^*$ function of the size of the network, and is, therefore, a small constant for any relevant network. Our decentralized reinforcement learning algorithm achieves approximately a factor of two cost improvement over the base policy for a range of radii bounded from below and above by two and three times the critical sensing radius, respectively.
在本研究中,我们考虑了著名的多辆车路由问题的一般化:给定一个网络、一群占有其节点一部分的Agents和一组任务,我们寻求一个最小成本序列的运动,受限于每个任务至少被某些Agents访问一次的限制。经典版本的这个问题假设有一个中央计算服务器,完美观察整个系统的状态,并按照一个集中控制Scheme对个体Agents进行指导。相比之下,我们假设没有中央服务器,每个Agent都是一个个体处理器,没有预先知道的底层网络(包括任务和Agents的位置)。此外,我们的Agents具有严格的本地通信和感知能力(限制在各自位置的固定半径内),更贴近几个真实的多Agent应用程序。这些限制引入了许多通过本地信息分享和直接协调Agents之间的合作克服了的挑战。我们提出了一个完全分布式、在线和可扩展的强化学习算法,以便Agents自我组织成本地簇,并独立地在本簇内应用多Agent部署计划方案。我们通过广泛的模拟实验经验证明,存在一个关键感知半径,超过该半径分布式部署算法开始优于贪婪基策略。这个关键感知半径 proportional to the $\log^*$ function of the network size,因此,对于任何相关的网络都是一个小型常数。我们的分散强化学习算法实现了大约两因子的成本改进,对于从下到上包括两个和三个关键感知半径的半径范围,分别实现了这一目标。
https://arxiv.org/abs/2305.15596
A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. This paper introduces plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool $\unicode{x2014}$ if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent has to re-learn from scratch due to exhausted plasticity or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient.
越来越多的证据表明,用于深度强化学习的神经网络逐渐失去了可塑性,即从新数据中学习的能力;然而,对这种现象的分析和缓解受到RL中可塑性、探索和表现复杂关系的制约。本文介绍了可塑性注入,这是一种简单的干预,可以增加网络的可塑性而无需改变训练参数的数量或偏置预测。该干预的应用有两个:第一,作为诊断工具,如果注入可以增加性能,我们可以得出结论,即代理的网络正在失去可塑性。该工具可以识别Atari环境中缺乏可塑性导致性能停滞不前的特定子集,激励未来研究理解并对抗可塑性损失。第二,可塑性注入可以用于改进RL训练的计算效率,如果代理因缺乏可塑性而必须从新学习或因动态增长代理网络而无需牺牲性能。Atari的结果表明,可塑性注入比其他任何方法都实现了更强的性能。
https://arxiv.org/abs/2305.15555
Translating natural language sentences to first-order logic (NL-FOL translation) is a longstanding challenge in the NLP and formal logic literature. This paper introduces LogicLLaMA, a LLaMA-7B model fine-tuned for NL-FOL translation using LoRA on a single GPU. LogicLLaMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LogicLLaMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel supervised fine-tuning (SFT) + reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then fine-tunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. To train LogicLLaMA, we present MALLS (large language $\textbf{M}$odel gener$\textbf{A}$ted N$\textbf{L}$-FO$\textbf{L}$ pair$\textbf{S}$), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at $\href{this https URL}{\small \text{this https URL}}$.
将自然语言句子转换为第一级逻辑(NL-FOL translation)是NLP和形式逻辑文献中一个长期存在的挑战。本文介绍了逻辑LLaMA,一个LLaMA-7B模型通过单个GPU使用LoRA微调了NL-FOL translation。逻辑LLaMA能够直接翻译自然语言到FOL规则,比GPT-3.5表现更好。逻辑LLaMA也具备纠正GPT-3.5预测的FOL规则的能力,并且以GPT-4的成本 fraction 之一实现了与GPT-4相似的性能。纠正能力是通过一种新的监督微调(SFT) + 强化学习与人类反馈(RLHF)框架实现的,该框架最初从GPT-4合成的略有偏差的NL-FOL对开始训练,然后使用RLHF在GPT-3.5的输出上进行微调,使用FOL验证器作为奖励模型。为了训练逻辑LLaMA,我们提供了MALLS(大型语言模型生成nel-FOL pairs数据集),从GPT-4收集了34K高质量的、多样化的句子级别的NL-FOL对。数据集是通过实现一个程序流,提示GPT-4对,并动态调整 prompts 以确保从不同的复杂性级别中收集具有丰富和多样化的上下文的对,并验证生成的FOL规则的的有效性。代码、权重和数据可在$this https URL$上获取。
https://arxiv.org/abs/2305.15541
Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of games as a test bed for LLMs.
开放世界生存游戏对AI算法构成了巨大的挑战,因为它们需要同时处理多个任务、深入探索和目标优先级要求。尽管强化学习(RL)在解决游戏方面非常流行,但其高样本复杂度在复杂开放世界游戏(如crafter或Minecraft)中限制了其效果。我们提出了一种新颖的方法Spring,以阅读游戏的原始学术 paper,并通过使用大型语言模型(LLM)学习知识来推理并玩这个游戏。根据 LaTeX 源作为游戏上下文,并描述当前观察的Agent,我们的Spring框架使用一个具有游戏相关问题作为节点和依赖关系作为边的生成图。通过遍历生成图并计算每个节点的LLM响应,我们可以确定在环境中采取最佳行动的最佳方法,该行动直接转化为环境行动。在我们的实验中,我们研究了在crafter开放世界环境中不同形式提示引起的上下文“推理”质量。我们的实验表明,当持续思考一致序列时,LLM具有完成复杂高级轨迹的巨大潜力。定量上,Spring与GPT-4在无训练的情况下击败了训练了1000万步的最先进的RL基准模型。最后,我们展示了游戏作为LLM测试床的潜力。
https://arxiv.org/abs/2305.15486
Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO) and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
演员Critic (AC)方法在强化学习(RL)中被广泛应用,并受益于使用任何策略梯度方法作为演员,使用价值基方法作为Critic的灵活性。Critic通常通过最小化TD错误来训练,这是一个可能与演员实现高奖励的真正目标相关的目标。我们解决这种不匹配的方法是设计一个共同的目标,以决策 aware 的方式训练演员和Critic。我们使用 proposed 目标来设计一个通用、AC算法,可以轻松处理任何函数近似。我们 explicitly characterize 条件,其中该算法保证单调的政策改进,无论选择的政策和Critic参数化如何。立即初始化通用算法,结果是一个演员,涉及最大化一系列模拟函数(类似于TRPO和PPO),一个Critic,涉及最小化密切相关的目标。使用简单的欺诈例子,我们可证明 proposed Critic 目标比标准平方误差的好处更大。最后,我们经验地证明了我们的决策 aware 演员Critic框架在简单的RL问题中的优点。
https://arxiv.org/abs/2305.15249
Reinforcement Learning (RL) is a powerful machine learning paradigm that has been applied in various fields such as robotics, natural language processing and game playing achieving state-of-the-art results. Targeted to solve sequential decision making problems, it is by design able to learn from experience and therefore adapt to changing dynamic environments. These capabilities make it a prime candidate for controlling and optimizing complex processes in industry. The key to fully exploiting this potential is the seamless integration of RL into existing industrial systems. The industrial communication standard Open Platform Communications UnifiedArchitecture (OPC UA) could bridge this gap. However, since RL and OPC UA are from different fields,there is a need for researchers to bridge the gap between the two technologies. This work serves to bridge this gap by providing a brief technical overview of both technologies and carrying out a semi-exhaustive literature review to gain insights on how RL and OPC UA are applied in combination. With this survey, three main research topics have been identified, following the intersection of RL with OPC UA. The results of the literature review show that RL is a promising technology for the control and optimization of industrial processes, but does not yet have the necessary standardized interfaces to be deployed in real-world scenarios with reasonably low effort.
强化学习(RL)是一种强大的机器学习范式,已经应用于各种领域,如机器人、自然语言处理和游戏玩,取得了最先进的结果。其目标是解决Sequential决策问题,因此可以设计从经验中学习,因此适应不断变化的动态环境。这些能力使其成为工业控制和优化复杂过程的首选。要充分利用这种潜力,关键是要将RL无缝融入现有的工业系统。工业通信标准Open Platform Communications Unified Architecture(OPC UA)可以填补这个差距。然而,由于RL和OPC UA来自不同领域,研究人员需要填补这两个技术之间的差距。这项工作旨在填补这个差距,通过提供两个技术的简要技术概述和进行半充分的文献综述来了解如何将RL和 OPC UA结合起来。通过这份调查,三个主要研究主题被识别,随着RL和 OPC UA之间的交集。文献综述的结果表明,RL是工业过程控制和优化的一种有前途的技术,但还不具备必要的标准化接口,以便以合理的较低 effort 在现实世界场景中部署。
https://arxiv.org/abs/2305.15113
Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.
大型语言模型在提示示例或指令的情况下在各种语言任务中表现出色。然而,仅通过提示来控制这些模型是有限的。通过 fine-tuning 语言模型(例如通过强化学习)可以有效优化,但它很昂贵,需要模型访问。我们提出了 Inference-time Policy Adapters (IPA),它能够高效地定制 GPT-3 等标准语言模型,而不需要 fine-tuning。在解码时间中,IPA 通过一个 lightweight policy adaptor 训练来指导一个大型基础模型,以通过强化学习优化任意用户目标。在五个具有挑战性的文字生成任务中,例如毒性减少和开放主题生成中,IPA consistently 带来显著改进,它比标准语言模型表现更好,有时甚至包括昂贵的 fine-tuning。特别是,通过定制 GPT-2 与 IPA 可以比 GPT-3 表现更好,而通过定制 GPT-3 与 IPA 则能够显著增强 GPT-3 的性能(有时甚至超过 GPT-4)。我们的 promising 结果是强调 IPA 作为 lightweight 替代方案,用于定制极端规模的语言模型的潜力。
https://arxiv.org/abs/2305.15065
A trustworthy real-world prediction system should be well-calibrated; that is, its confidence in an answer is indicative of the likelihood that the answer is correct, enabling deferral to a more expensive expert in cases of low-confidence predictions. While recent studies have shown that unsupervised pre-training produces large language models (LMs) that are remarkably well-calibrated, the most widely-used LMs in practice are fine-tuned with reinforcement learning with human feedback (RLHF-LMs) after the initial unsupervised pre-training stage, and results are mixed as to whether these models preserve the well-calibratedness of their ancestors. In this paper, we conduct a broad evaluation of computationally feasible methods for extracting confidence scores from LLMs fine-tuned with RLHF. We find that with the right prompting strategy, RLHF-LMs verbalize probabilities that are much better calibrated than the model's conditional probabilities, enabling fairly well-calibrated predictions. Through a combination of prompting strategy and temperature scaling, we find that we can reduce the expected calibration error of RLHF-LMs by over 50%.
一个可靠的现实世界预测系统应该进行精确的校准。也就是说,其对答案的的信心反映了答案是否正确的可能性,从而能够在低信心预测的情况下将答案推迟到更昂贵的专家那里。尽管最近的研究表明,未监督的前训练产生大型语言模型(LMs)表现得非常校准,但在实践中,最常用的LMs是在最初未监督的前训练阶段通过强化学习与人类反馈(RLHF-LMs)进行微调的,结果好坏不一,这些模型是否保持了其祖先的校准性仍待验证。在本文中,我们对所有可行的计算方式进行了广泛的评估,以提取与RLHF-LMs微调后进行强化学习与人类反馈(RLHF-LMs)的信心评分。我们发现,通过适当的提示策略,RLHF-LMs用更校准的概率表示了模型的条件概率,使其能够进行相当校准的预测。通过结合提示策略和温度 scaling,我们发现,我们可以将RLHF-LMs的预期校准误差降低超过50%。
https://arxiv.org/abs/2305.14975
Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. this https URL
提高语言模型生成根据某些用户定义的质量或风格限制是一项挑战性的任务。典型的方法包括利用额外的人类编写的数据学习,使用启发式方法过滤“低质量”的数据,或使用人类反馈的强化学习方法(RLHF)。然而,过滤可以删除宝贵的训练信号,而数据收集和RLHF经常需要额外的人类编写或LM探索数据,这些数据可能非常昂贵。一个自然的问题是:“我们如何利用RL优化LM在现有的 crowd-sourced 和互联网数据中的 utility?”为此,我们介绍了剩余午餐RL(LoL-RL),这是一个简单的训练算法,使用离线决策梯度学习语言生成任务,并将其作为1步RL游戏。LoL-RL可以优化LMs,以在任意分类基于或人类定义的 utility函数的任何序列到序列数据上优化任意分类器或人类定义的 utility函数。使用不同大小模型和多个奖励的实验,使用了五个不同的语言生成任务,表明训练使用LoL-RL的模型可以 consistently outperform 最好的监督学习模型。我们还发布了我们的实验代码。 this https URL 是本网站提供的实验代码。
https://arxiv.org/abs/2305.14718
Pretrained model-based evaluation metrics have demonstrated strong performance with high correlations with human judgments in various natural language generation tasks such as image captioning. Despite the impressive results, their impact on fairness is under-explored -- it is widely acknowledged that pretrained models can encode societal biases, and utilizing them for evaluation purposes may inadvertently manifest and potentially amplify biases. In this paper, we conduct a systematic study in gender biases of model-based evaluation metrics with a focus on image captioning tasks. Specifically, we first identify and quantify gender biases in different evaluation metrics regarding profession, activity, and object concepts. Then, we demonstrate the negative consequences of using these biased metrics, such as favoring biased generation models in deployment and propagating the biases to generation models through reinforcement learning. We also present a simple but effective alternative to reduce gender biases by combining n-gram matching-based and pretrained model-based evaluation metrics.
训练模型评估指标已经证明在诸如图像标题生成等各种自然语言生成任务中表现出强劲的性能,并与人类判断高度相关。尽管取得了令人印象深刻的结果,但它们对公正性的影响仍然未被充分探讨。众所周知,训练模型可以编码社会偏见,使用这些模型进行评估可能无意中表现出并可能进一步加剧偏见。在本文中,我们重点探讨了基于模型的评估指标中的性别偏见,并进行了系统性的研究。具体而言,我们首先确定了和量化了不同评估指标中涉及职业、活动和对象概念方面的性别偏见。然后,我们展示了使用这些偏见指标的消极后果,例如在部署中偏袒偏见生成模型,并通过强化学习传播偏见到生成模型。我们还提出了一种简单但有效的替代方案,通过结合基于词袋匹配的和预先训练模型的评估指标来减少性别偏见。
https://arxiv.org/abs/2305.14711
Animals have evolved various agile locomotion strategies, such as sprinting, leaping, and jumping. There is a growing interest in developing legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agility. We introduce the Barkour benchmark, an obstacle course to quantify agility for legged robots. Inspired by dog agility competitions, it consists of diverse obstacles and a time based scoring mechanism. This encourages researchers to develop controllers that not only move fast, but do so in a controllable and versatile way. To set strong baselines, we present two methods for tackling the benchmark. In the first approach, we train specialist locomotion skills using on-policy reinforcement learning methods and combine them with a high-level navigation controller. In the second approach, we distill the specialist skills into a Transformer-based generalist locomotion policy, named Locomotion-Transformer, that can handle various terrains and adjust the robot's gait based on the perceived environment and robot states. Using a custom-built quadruped robot, we demonstrate that our method can complete the course at half the speed of a dog. We hope that our work represents a step towards creating controllers that enable robots to reach animal-level agility.
动物已经发展了许多敏捷的运动策略,例如跑步、跳跃和跳跃。对开发像生物学中动物一样的腿机器人越来越感兴趣,并且它们表现出各种敏捷技能,以快速适应复杂的环境。尽管有兴趣,该领域缺乏 systematic 基准来测量控制政策和硬件在敏捷性的表现的。我们介绍了Barkour 基准,这是一个用于衡量腿机器人敏捷性的障碍 Course。受到狗敏捷性竞赛的启发,它由多种障碍物和一个时间评分机制组成。这鼓励研究人员开发不仅可以快速移动,而且可以在可控和多功能性方面进行控制的控制器。为了建立强有力的基准,我们提出了两种方法来处理基准。在第一种方法中,我们使用基于政策强化学习的方法来训练专门的腿动技能,并将其与高级导航控制器相结合。在第二种方法中,我们将该专业技能分解成一个基于Transformer的通用腿动策略,名为 Locomotion-Transformer,它可以处理各种地形并根据感知的环境和机器人状态调整机器人的步伐。使用定制的四足机器人,我们演示了我们的方法可以在狗的速度快一半的情况下完成课程。我们希望我们的工作代表了创建一个控制器,使机器人能够达到动物级别的敏捷性的步骤。
https://arxiv.org/abs/2305.14654
Autonomous driving has received a great deal of attention in the automotive industry and is often seen as the future of transportation. The development of autonomous driving technology has been greatly accelerated by the growth of end-to-end machine learning techniques that have been successfully used for perception, planning, and control tasks. An important aspect of autonomous driving planning is knowing how the environment evolves in the immediate future and taking appropriate actions. An autonomous driving system should effectively use the information collected from the various sensors to form an abstract representation of the world to maintain situational awareness. For this purpose, deep learning models can be used to learn compact latent representations from a stream of incoming data. However, most deep learning models are trained end-to-end and do not incorporate any prior knowledge (e.g., from physics) of the vehicle in the architecture. In this direction, many works have explored physics-infused neural network (PINN) architectures to infuse physics models during training. Inspired by this observation, we present a Kalman filter augmented recurrent neural network architecture to learn the latent representation of the traffic flow using front camera images only. We demonstrate the efficacy of the proposed model in both imitation and reinforcement learning settings using both simulated and real-world datasets. The results show that incorporating an explicit model of the vehicle (states estimated using Kalman filtering) in the end-to-end learning significantly increases performance.
在汽车业中,自动驾驶技术受到了广泛关注,被视为交通运输的未来。自动驾驶技术的发展受到了成功应用于感知、规划和控制任务的终末机器学习技术的快速发展的大大加速。自动驾驶计划的一个重要的方面是了解自然环境在立即 future 中的演变情况,并采取适当的行动。自动驾驶系统应该有效地利用从各种传感器收集的信息,以形成世界抽象表示,维持情境意识。为此,深度学习模型可以用来从 incoming 数据流中学习紧凑的隐态表示。然而,大多数深度学习模型是终末训练的,并且没有将车辆的任何先验知识(例如从物理学)融入架构中。在这方面,许多工作探索了物理学融入神经网络(PINN)架构,在训练期间将物理学模型注入其中。受此观察启发,我们提出了一个卡尔曼滤波增强循环神经网络架构,仅使用前方相机图像学习交通流的隐态表示。我们使用模拟和现实世界数据集证明了该模型的效能,在模仿和强化学习设置中。结果显示,将车辆显式模型(使用卡尔曼滤波估计状态)融入终末学习可以提高性能。
https://arxiv.org/abs/2305.14644