Given the increasing demand for mental health assistance, artificial intelligence (AI), particularly large language models (LLMs), may be valuable for integration into automated clinical support systems. In this work, we leverage a decision transformer architecture for topic recommendation in counseling conversations between patients and mental health professionals. The architecture is utilized for offline reinforcement learning, and we extract states (dialogue turn embeddings), actions (conversation topics), and rewards (scores measuring the alignment between patient and therapist) from previous turns within a conversation to train a decision transformer model. We demonstrate an improvement over baseline reinforcement learning methods, and propose a novel system of utilizing our model's output as synthetic labels for fine-tuning a large language model for the same task. Although our implementation based on LLaMA-2 7B has mixed results, future work can undoubtedly build on the design.
鉴于对心理健康协助的需求不断增加,人工智能(AI)特别是大型语言模型(LLMs)可能对于将自动化临床支持系统集成到一起非常有价值。在这项工作中,我们利用决策转换器架构来进行患者和心理健康专业人员之间心理咨询对话的主题推荐。该架构用于离线强化学习,并从对话的前几轮中提取状态(对话轮嵌入)、动作(对话主题)和奖励(衡量患者和治疗师之间同步的分数)来训练决策转换器模型。我们证明了与基线强化学习方法相比的改善,并提出了将模型输出作为大型语言模型的同任务合成标签,用于微调大型语言模型的全新系统。尽管我们的基于LLLA-2 7B的实现结果喜忧参半,但未来的工作无疑可以在此设计上继续发展。
https://arxiv.org/abs/2405.05060
Reinforcement learning provides an appealing framework for robotic control due to its ability to learn expressive policies purely through real-world interaction. However, this requires addressing real-world constraints and avoiding catastrophic failures during training, which might severely impede both learning progress and the performance of the final policy. In many robotics settings, this amounts to avoiding certain "unsafe" states. The high-speed off-road driving task represents a particularly challenging instantiation of this problem: a high-return policy should drive as aggressively and as quickly as possible, which often requires getting close to the edge of the set of "safe" states, and therefore places a particular burden on the method to avoid frequent failures. To both learn highly performant policies and avoid excessive failures, we propose a reinforcement learning framework that combines risk-sensitive control with an adaptive action space curriculum. Furthermore, we show that our risk-sensitive objective automatically avoids out-of-distribution states when equipped with an estimator for epistemic uncertainty. We implement our algorithm on a small-scale rally car and show that it is capable of learning high-speed policies for a real-world off-road driving task. We show that our method greatly reduces the number of safety violations during the training process, and actually leads to higher-performance policies in both driving and non-driving simulation environments with similar challenges.
强化学习在机器人控制领域具有通过现实世界交互学习富有表现力的策略的吸引力。然而,这需要解决现实世界的约束,并在训练过程中避免灾难性故障,这可能会极大地阻碍学习和最终策略的性能。在许多机器人设置中,这等价于避免某些“不安全”的状态。高速越野驾驶任务代表了一个 particularly 困难的实例:高回报策略应该尽可能积极和快速地驱动,这往往需要接近“安全”状态的边缘,因此对方法避免频繁失败提出了特殊要求。为了同时学习高度有效的策略并避免过度故障,我们提出了一个结合风险敏感控制和自适应动作空间课程的强化学习框架。此外,我们还证明了我们的风险敏感目标在配备知识不确定性估计器时能够自动避免离散状态。我们在一个小型 rally car 上实现了我们的算法,并证明了它能够学会真实世界越野驾驶任务中的高速策略。我们证明了我们的方法在训练过程中大大减少了安全违规数量,并且实际上在具有类似挑战性的驾驶和非驾驶仿真环境中,都能获得更高性能的策略。
https://arxiv.org/abs/2405.04714
Proximal Policy Optimization with Adaptive Exploration (axPPO) is introduced as a novel learning algorithm. This paper investigates the exploration-exploitation tradeoff within the context of reinforcement learning and aims to contribute new insights into reinforcement learning algorithm design. The proposed adaptive exploration framework dynamically adjusts the exploration magnitude during training based on the recent performance of the agent. Our proposed method outperforms standard PPO algorithms in learning efficiency, particularly when significant exploratory behavior is needed at the beginning of the learning process.
距离策略优化(axPPO)作为一种新的学习算法被介绍。本文研究了在强化学习背景下元启发式(Adaptive Exploration)与探索-利用权衡,旨在为强化学习算法设计提供新的见解。所提出的自适应探索框架在训练过程中动态调整探索大小,基于最近代理的表现。与标准PPO算法相比,我们提出的方法在学习效率上表现出色,尤其是在学习过程开始时需要显著的探索行为时。
https://arxiv.org/abs/2405.04664
In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capability, flexibility, and reliability remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern decision-making library that offers efficient and thoroughly tested reusable components. ACEGEN provides a robust, flexible, and efficient platform for molecular design. We validate its effectiveness by benchmarking it across various algorithms and conducting multiple drug discovery case studies. ACEGEN is accessible at this https URL.
近年来,强化学习(RL)已成为药物设计中的一个有价值的工具,因为它可以提出和优化具有所需性质的分子。然而,在复杂的高级RL算法和高度依赖专用代码的情况下,实现能力、灵活性和可靠性的平衡仍然具有挑战性。在这项工作中,我们介绍了ACEGEN,一个专为生成性药物设计而设计的全面而精简的工具包,使用TorchRL构建,这是一个现代的决策库,提供了高效和经过充分测试的可重用组件。ACEGEN为分子设计提供了稳健、灵活和高效的平台。我们通过在各种算法上对其进行基准测试和在多个药物发现案例研究中验证其有效性来验证其效果。ACEGEN可在此处访问:https://url.org/
https://arxiv.org/abs/2405.04657
The training, testing, and deployment, of autonomous vehicles requires realistic and efficient simulators. Moreover, because of the high variability between different problems presented in different autonomous systems, these simulators need to be easy to use, and easy to modify. To address these problems we introduce TorchDriveSim and its benchmark extension TorchDriveEnv. TorchDriveEnv is a lightweight reinforcement learning benchmark programmed entirely in Python, which can be modified to test a number of different factors in learned vehicle behavior, including the effect of varying kinematic models, agent types, and traffic control patterns. Most importantly unlike many replay based simulation approaches, TorchDriveEnv is fully integrated with a state of the art behavioral simulation API. This allows users to train and evaluate driving models alongside data driven Non-Playable Characters (NPC) whose initializations and driving behavior are reactive, realistic, and diverse. We illustrate the efficiency and simplicity of TorchDriveEnv by evaluating common reinforcement learning baselines in both training and validation environments. Our experiments show that TorchDriveEnv is easy to use, but difficult to solve.
自动驾驶车辆的训练、测试和部署需要真实的和高效的仿真器。此外,由于不同自治系统中呈现的问题之间的差异很大,这些仿真器需要易于使用且易于修改。为解决这些问题,我们引入了TorchDriveSim和其基准扩展TorchDriveEnv。TorchDriveEnv是一个完全用Python编写的轻量级强化学习基准,可以修改以测试学习到的车辆行为的多个不同因素,包括影响运动模型的不同,代理类型和交通控制模式。最重要的是,与许多基于回放的仿真方法不同,TorchDriveEnv完全集成了最先进的 behavioral simulation API。这使得用户可以在数据驱动的非玩家角色(NPC)的初始化和驾驶行为反应实时、真实和多样化的同时训练和评估驾驶模型。我们通过在训练和验证环境中评估常见的强化学习基准来说明TorchDriveEnv的易用性和简单性。我们的实验结果表明,TorchDriveEnv很容易使用,但很难解决。
https://arxiv.org/abs/2405.04491
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at "this https URL.
我们提出了DeepSeek-V2,一种强大的Mixture-of-Experts(MoE)语言模型,具有经济实惠的训练和高效的推理。它包括236B总参数,其中21B针对每个词条激活,并支持128K个词条的上下文长度。DeepSeek-V2采用了创新架构,包括多头潜在注意力(MLA)和DeepSeekMoE。MLA通过显著压缩键值(KV)缓存将其压缩为潜在向量,从而保证高效的推理,而DeepSeekMoE则通过稀疏计算以经济实惠的成本训练强大的模型。与DeepSeek 67B相比,DeepSeek-V2实现了显著的性能提升,同时降低了训练成本,将KV缓存减小了93.3%,并将最大生成吞吐量提高了5.76倍。我们对DeepSeek-V2进行了预训练,在由8.1T个词条组成的高质量多源语料库上,并进一步进行了监督微调(SFT)和强化学习(RL)以充分发掘其潜力。评估结果显示,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然在开源模型中实现了顶级性能。模型检查点可在“此https URL上找到。
https://arxiv.org/abs/2405.04434
Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
近年来,大型语言模型(LLMs)的发展为开发自然语言代理提供了强大的基础,但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗,这是我们指的欺骗性行为或陈述,包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中,通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息,远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演,我们(i)要么让兔子从隐藏的陷阱门中出来,要么(ii)我们的重点是,观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中,当它们被设计为在自然语言生成中具有欺骗能力时,展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标,我们通过强化学习框架开发了欺骗能力,并围绕语言哲学和认知心理学理论进行构建。我们发现,通过后续的对抗性交互试验, lobbyist代理的欺骗能力增加了~ 40%(相对),我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题,即代理可能会操纵人类以实现其预设目标。
https://arxiv.org/abs/2405.04325
Offline reinforcement learning (RL) provides a promising approach to avoid costly online interaction with the real environment. However, the performance of offline RL highly depends on the quality of the datasets, which may cause extrapolation error in the learning process. In many robotic applications, an inaccurate simulator is often available. However, the data directly collected from the inaccurate simulator cannot be directly used in offline RL due to the well-known exploration-exploitation dilemma and the dynamic gap between inaccurate simulation and the real environment. To address these issues, we propose a novel approach to combine the offline dataset and the inaccurate simulation data in a better manner. Specifically, we pre-train a generative adversarial network (GAN) model to fit the state distribution of the offline dataset. Given this, we collect data from the inaccurate simulator starting from the distribution provided by the generator and reweight the simulated data using the discriminator. Our experimental results in the D4RL benchmark and a real-world manipulation task confirm that our method can benefit more from both inaccurate simulator and limited offline datasets to achieve better performance than the state-of-the-art methods.
离线强化学习(RL)提供了一种有前途的方法来避免与真实环境进行昂贵的在线交互。然而,离线RL的性能高度依赖于数据质量,这可能导致学习过程中的扩展误差。在许多机器人应用中,通常缺乏准确的仿真器。然而,由于众所周知的学习-探索困境和仿真器和现实环境之间的动态差距,直接从不准确的仿真器中收集的数据无法直接用于离线RL。为解决这些问题,我们提出了一种结合离线数据和低质量仿真数据的新方法。具体来说,我们预训练了一个生成对抗网络(GAN)模型来适应离线数据的状态分布。然后,我们从生成器提供的分布开始收集离线仿真器的数据,并使用判别器对模拟数据进行重新加权。我们在D4RL基准和现实世界的操作任务中的实验结果证实,我们的方法可以从不准确的仿真器和有限离线数据中获得更好的性能,比最先进的方法具有更高的性能。
https://arxiv.org/abs/2405.04307
One of the key challenges in current Reinforcement Learning (RL)-based Automated Driving (AD) agents is achieving flexible, precise, and human-like behavior cost-effectively. This paper introduces an innovative approach utilizing Large Language Models (LLMs) to intuitively and effectively optimize RL reward functions in a human-centric way. We developed a framework where instructions and dynamic environment descriptions are input into the LLM. The LLM then utilizes this information to assist in generating rewards, thereby steering the behavior of RL agents towards patterns that more closely resemble human driving. The experimental results demonstrate that this approach not only makes RL agents more anthropomorphic but also reaches better performance. Additionally, various strategies for reward-proxy and reward-shaping are investigated, revealing the significant impact of prompt design on shaping an AD vehicle's behavior. These findings offer a promising direction for the development of more advanced and human-like automated driving systems. Our experimental data and source code can be found here.
https://arxiv.org/abs/2405.04135
Recent advances in robot skill learning have unlocked the potential to construct task-agnostic skill libraries, facilitating the seamless sequencing of multiple simple manipulation primitives (aka. skills) to tackle significantly more complex tasks. Nevertheless, determining the optimal sequence for independently learned skills remains an open problem, particularly when the objective is given solely in terms of the final geometric configuration rather than a symbolic goal. To address this challenge, we propose Logic-Skill Programming (LSP), an optimization-based approach that sequences independently learned skills to solve long-horizon tasks. We formulate a first-order extension of a mathematical program to optimize the overall cumulative reward of all skills within a plan, abstracted by the sum of value functions. To solve such programs, we leverage the use of Tensor Train to construct the value function space, and rely on alternations between symbolic search and skill value optimization to find the appropriate skill skeleton and optimal subgoal sequence. Experimental results indicate that the obtained value functions provide a superior approximation of cumulative rewards compared to state-of-the-art Reinforcement Learning methods. Furthermore, we validate LSP in three manipulation domains, encompassing both prehensile and non-prehensile primitives. The results demonstrate its capability to identify the optimal solution over the full logic and geometric path. The real-robot experiments showcase the effectiveness of our approach to cope with contact uncertainty and external disturbances in the real world.
近年来,机器人技能学习的进步解锁了构建任务无关技能库的潜力,从而使多步简单操作原语(即技能)的无缝排序变得容易,从而应对显著更加复杂任务的处理。然而,确定独立学习的技能的最佳顺序仍然是一个开放问题,尤其是在目标仅基于最终几何配置而不是符号目标时。为解决这一挑战,我们提出了逻辑技能编程(LSP),一种基于优化的方法,将独立学习的技能序列化为解决长期时间间隔任务的解决方案。我们将一个数学程序的第一阶扩展用于优化整个累积奖励的所有技能,将其抽象为一个价值函数的和。为解决这类问题,我们利用了Tensor Train构建价值函数空间,并依赖于符号搜索和技能价值优化之间的交替来寻找适当的技能骨架和最优子目标序列。实验结果表明,所得到的价值函数能够比最先进的强化学习方法提供更好的累积奖励逼近。此外,我们在三个操作域中验证了LSP,包括抓握和非抓握原语。结果表明,我们的方法在整个逻辑和几何路径上能够找到最优解决方案。在现实世界的实验中,我们的方法应对接触不确定性和外部干扰的有效性得到了展示。
https://arxiv.org/abs/2405.04082
Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online learning paradigm prevents its widespread adoption, especially in hazardous or costly scenarios. Offline RL has emerged as an alternative solution, learning from pre-collected static datasets. However, this offline learning introduces a new challenge known as distributional shift, degrading the performance when the policy is evaluated on scenarios that are Out-Of-Distribution (OOD) from the training dataset. Most existing offline RL resolves this issue by regularizing policy learning within the information supported by the given dataset. However, such regularization overlooks the potential for high-reward regions that may exist beyond the dataset. This motivates exploring novel offline learning techniques that can make improvements beyond the data support without compromising policy performance, potentially by learning causation (cause-and-effect) instead of correlation from the dataset. In this paper, we propose the MOOD-CRL (Model-based Offline OOD-Adapting Causal RL) algorithm, which aims to address the challenge of extrapolation for offline policy training through causal inference instead of policy-regularizing methods. Specifically, Causal Normalizing Flow (CNF) is developed to learn the transition and reward functions for data generation and augmentation in offline policy evaluation and training. Based on the data-invariant, physics-based qualitative causal graph and the observational data, we develop a novel learning scheme for CNF to learn the quantitative structural causal model. As a result, CNF gains predictive and counterfactual reasoning capabilities for sequential decision-making tasks, revealing a high potential for OOD adaptation. Our CNF-based offline RL approach is validated through empirical evaluations, outperforming model-free and model-based methods by a significant margin.
尽管强化学习(RL)取得了显著的成功,但在线学习范式普遍使用导致其广泛应用受限,尤其是在危险或昂贵的情景中。离线强化学习(Offline RL)作为一种替代方案应运而生,通过预先收集的静态数据集进行学习。然而,这种离线学习引入了一个名为分布平滑的新挑战,当策略在训练数据集之外的场景上评估时,会降低其性能。为解决此问题,大多数现有的离线强化学习方法通过在给定数据集支持的范围内对策略进行规范化来解决。然而,这种规范化方法忽视了数据之外可能存在高奖励区域的事实。因此,探索新的离线学习方法具有提高数据支持下的策略性能而不会牺牲策略性能潜力,通过从数据中学习因果关系(原因和结果)来解决此问题。在本文中,我们提出了MOOD-CRL(基于模型的离线OUD适应因果RL)算法,旨在通过因果推理而不是策略规范化方法来解决离线策略训练的扩展问题。具体来说,我们开发了因果正常化流(CNF)来学习数据生成和增强在离线策略评估和训练中的转移和奖励函数。基于数据无关、基于物理的定性因果图和观测数据,我们为CNF开发了一种新的学习方案,以学习量化结构因果模型。因此,CNF在序列决策任务中获得了预测和反事实推理能力,揭示了其在大数据迁移方面的巨大潜力。我们的基于CNF的离线强化学习方法通过实证评估证明了比模型免费和基于模型的方法具有显著的优越性。
https://arxiv.org/abs/2405.03892
Adversarial machine learning, focused on studying various attacks and defenses on machine learning (ML) models, is rapidly gaining importance as ML is increasingly being adopted for optimizing wireless systems such as Open Radio Access Networks (O-RAN). A comprehensive modeling of the security threats and the demonstration of adversarial attacks and defenses on practical AI based O-RAN systems is still in its nascent stages. We begin by conducting threat modeling to pinpoint attack surfaces in O-RAN using an ML-based Connection management application (xApp) as an example. The xApp uses a Graph Neural Network trained using Deep Reinforcement Learning and achieves on average 54% improvement in the coverage rate measured as the 5th percentile user data rates. We then formulate and demonstrate evasion attacks that degrade the coverage rates by as much as 50% through injecting bounded noise at different threat surfaces including the open wireless medium itself. Crucially, we also compare and contrast the effectiveness of such attacks on the ML-based xApp and a non-ML based heuristic. We finally develop and demonstrate robust training-based defenses against the challenging physical/jamming-based attacks and show a 15% improvement in the coverage rates when compared to employing no defense over a range of noise budgets
对抗性机器学习(Adversarial machine learning)专注于研究各种针对机器学习(ML)模型的攻击和防御方法,随着越来越多的无线系统采用ML进行优化,对抗性机器学习在无线系统(如开放式无线接入网络O-RAN)中的应用正日益具有重要意义。全面建模安全威胁以及实际基于AI的O-RAN系统上的对抗性攻击和防御仍然是萌芽阶段。我们首先使用基于ML的连接管理应用程序(xApp)进行威胁建模,以确定O-RAN中的攻击面。xApp使用使用深度强化学习训练的图神经网络,实现了平均54%的用户数据速率覆盖率的增长。接着,我们formulate和demonstrate通过在不同的威胁面包括无线通信介质本身中注入有界噪声来降低覆盖率的攻击。关键是,我们还比较和对比了这种攻击在基于ML的xApp和非基于ML的启发式方法上的效果。最后,我们开发和展示了对抗性防御,用于应对具有挑战性的物理/干扰 based攻击,并展示了与不采取任何防御措施时的覆盖率相比,覆盖率提高了15%。
https://arxiv.org/abs/2405.03891
Undeniably, Large Language Models (LLMs) have stirred an extraordinary wave of innovation in the machine learning research domain, resulting in substantial impact across diverse fields such as reinforcement learning, robotics, and computer vision. Their incorporation has been rapid and transformative, marking a significant paradigm shift in the field of machine learning research. However, the field of experimental design, grounded on black-box optimization, has been much less affected by such a paradigm shift, even though integrating LLMs with optimization presents a unique landscape ripe for exploration. In this position paper, we frame the field of black-box optimization around sequence-based foundation models and organize their relationship with previous literature. We discuss the most promising ways foundational language models can revolutionize optimization, which include harnessing the vast wealth of information encapsulated in free-form text to enrich task comprehension, utilizing highly flexible sequence models such as Transformers to engineer superior optimization strategies, and enhancing performance prediction over previously unseen search spaces.
毫无疑问,大型语言模型(LLMs)在机器学习领域引起了非凡的创新波,对诸如强化学习、机器人学和计算机视觉等各个领域产生了重大影响。它们的出现速度之快、影响之深,标志着机器学习研究领域的范式发生了重大转变。然而,基于黑盒优化的实验设计领域受到的影响要小得多,尽管将LLM与优化相结合,为该领域开拓了独特的探索景观。在本文论文中,我们将围绕基于序列的基础模型来组织黑色盒优化领域,并探讨LLM与之前文献的关系。我们讨论了LLM可以如何通过利用自由的文本中蕴含的丰富信息来改善任务理解,使用具有高度灵活性的序列模型(如Transformer)来设计卓越的优化策略,以及通过增强以前未曾见过的搜索空间的表现来提高性能预测等最具潜力的途径。
https://arxiv.org/abs/2405.03547
Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.
强化学习(RL)通过环境交互来学习策略是一个有前途的框架,但通常需要大量的交互数据来解决复杂任务。一方面包括通过离线数据增强 RL,以展示所需任务的强化学习,但先前的研究表明,通常需要大量高质量的演示数据,这很难获得,尤其是在机器人领域。我们的方法包括一个反向课程和一个正向课程。与先前的研究相比,我们方法的独特之处在于能够通过通过状态重置生成的每个演示的逆向课程来有效地利用多个演示。反向课程的结果是一个在窄的初始状态分布上表现良好的初始策略,并帮助克服困难的探索问题。然后使用正向课程来加速初始策略的训练,以在任务的全初始状态分布上表现良好,并提高演示和采样效率。我们证明了,在我们的方法 RFCL 中,结合反向课程和正向课程,能够显著提高演示和采样效率,与各种从演示中学习的基线相比,甚至解决了以前无法解决的任务,这些任务需要高精度和控制。
https://arxiv.org/abs/2405.03379
Q-learning excels in learning from feedback within sequential decision-making tasks but requires extensive sampling for significant improvements. Although reward shaping is a powerful technique for enhancing learning efficiency, it can introduce biases that affect agent performance. Furthermore, potential-based reward shaping is constrained as it does not allow for reward modifications based on actions or terminal states, potentially limiting its effectiveness in complex environments. Additionally, large language models (LLMs) can achieve zero-shot learning, but this is generally limited to simpler tasks. They also exhibit low inference speeds and occasionally produce hallucinations. To address these issues, we propose \textbf{LLM-guided Q-learning} that employs LLMs as heuristic to aid in learning the Q-function for reinforcement learning. It combines the advantages of both technologies without introducing performance bias. Our theoretical analysis demonstrates that the LLM heuristic provides action-level guidance. Additionally, our architecture has the capability to convert the impact of hallucinations into exploration costs. Moreover, the converged Q function corresponds to the MDP optimal Q function. Experiment results demonstrated that our algorithm enables agents to avoid ineffective exploration, enhances sampling efficiency, and is well-suited for complex control tasks.
Q-learning 在序列决策任务中从反馈中学习表现出色,但需要进行广泛的抽样以实现显著的改进。尽管奖励塑造是一种强大的学习技术,但它可能会引入偏差,影响代理的表现。此外,基于潜在的奖励塑造方法受到约束,因为它不允许根据动作或状态进行奖励修改,这可能限制其在复杂环境中的有效性。此外,大型语言模型(LLMs)可以实现零散学习,但通常仅限于较简单的任务。它们还表现出低推理速度,偶尔产生幻觉。为了应对这些问题,我们提出了 \textbf{LLM-guided Q-learning} 方法,它使用 LLMs 作为启发式,帮助学习者学习 Q 函数,实现强化学习。它结合了两种技术的优势,而没有引入性能偏差。我们的理论分析表明,LLM 启发式提供了动作级别指导。此外,我们的架构具有将幻觉影响转换为探索成本的能力。此外,收敛的 Q 函数与 MDP 最优 Q 函数相等。实验结果表明,我们的算法使代理能够避免无效探索,提高抽样效率,并适用于复杂的控制任务。
https://arxiv.org/abs/2405.03341
Purpose: Autonomous navigation of devices in endovascular interventions can decrease operation times, improve decision-making during surgery, and reduce operator radiation exposure while increasing access to treatment. This systematic review explores recent literature to assess the impact, challenges, and opportunities artificial intelligence (AI) has for the autonomous endovascular intervention navigation. Methods: PubMed and IEEEXplore databases were queried. Eligibility criteria included studies investigating the use of AI in enabling the autonomous navigation of catheters/guidewires in endovascular interventions. Following PRISMA, articles were assessed using QUADAS-2. PROSPERO: CRD42023392259. Results: Among 462 studies, fourteen met inclusion criteria. Reinforcement learning (9/14, 64%) and learning from demonstration (7/14, 50%) were used as data-driven models for autonomous navigation. Studies predominantly utilised physical phantoms (10/14, 71%) and in silico (4/14, 29%) models. Experiments within or around the blood vessels of the heart were reported by the majority of studies (10/14, 71%), while simple non-anatomical vessel platforms were used in three studies (3/14, 21%), and the porcine liver venous system in one study. We observed that risk of bias and poor generalisability were present across studies. No procedures were performed on patients in any of the studies reviewed. Studies lacked patient selection criteria, reference standards, and reproducibility, resulting in low clinical evidence levels. Conclusions: AI's potential in autonomous endovascular navigation is promising, but in an experimental proof-of-concept stage, with a technology readiness level of 3. We highlight that reference standards with well-identified performance metrics are crucial to allow for comparisons of data-driven algorithms proposed in the years to come.
目的:在介入治疗中,自主导航设备的操作时间可以缩短,手术过程中的决策可以得到改善,同时辐射剂量可以降低,同时提高治疗的可获取性。本系统综述评估了近年来关于人工智能(AI)在自主导航穿刺器/引导线在介入治疗中的影响、挑战和机会的文献,以评估AI在自主导航穿刺器/引导线在介入治疗中的潜在影响。方法:PubMed和IEEE Explore数据库进行查询。符合资格标准的研究包括研究使用AI促进自主导航穿刺器/引导线在介入治疗中的应用。然后使用PRISMA和QUADAS-2对文章进行评估。PROSPERO: CRD42023392259。结果:在462篇论文中,有14篇符合资格标准。强化学习(9/14,64%)和学习演示(7/14,50%)被用作数据驱动模型进行自主导航。研究主要使用物理幻象(10/14,71%)和仿真(4/14,29%)模型。大多数研究(10/14,71%)报道了心脏血管内实验,而三篇研究(3/14,21%)使用了简单非解剖性血管平台,一篇研究(1/14,7)使用了猪肝静脉系统。我们观察到,研究中的偏见和普遍性存在。在所有审查的研究中,没有对患者进行任何操作。研究缺乏患者选择标准、参考标准和可重复性,导致临床证据水平较低。结论:AI在自主导航介入治疗中的潜在影响是积极的,但目前仍处于实验验证阶段,技术成熟度为3。我们强调,具有明确定义的性能指标的参考标准对于允许未来数据驱动算法的比较至关重要。
https://arxiv.org/abs/2405.03305
In the course of the energy transition, the expansion of generation and consumption will change, and many of these technologies, such as PV systems, electric cars and heat pumps, will influence the power flow, especially in the distribution grids. Scalable methods that can make decisions for each grid connection are needed to enable congestion-free grid operation in the distribution grids. This paper presents a novel end-to-end approach to resolving congestion in distribution grids with deep reinforcement learning. Our architecture learns to curtail power and set appropriate reactive power to determine a non-congested and, thus, feasible grid state. State-of-the-art methods such as the optimal power flow (OPF) demand high computational costs and detailed measurements of every bus in a grid. In contrast, the presented method enables decisions under sparse information with just some buses observable in the grid. Distribution grids are generally not yet fully digitized and observable, so this method can be used for decision-making on the majority of low-voltage grids. On a real low-voltage grid the approach resolves 100\% of violations in the voltage band and 98.8\% of asset overloads. The results show that decisions can also be made on real grids that guarantee sufficient quality for congestion-free grid operation.
在能源转型的过程中,发电和消费的扩张将发生变化,许多技术,如光伏系统、电动汽车和热泵,将影响电力流动,特别是在配电电网。可扩展的方法需要为每个电网连接做出决策,以便在配电电网中实现无拥塞的电网运行。本文介绍了一种用深度强化学习解决配电电网拥塞的新型端到端方法。我们的架构学会了抑制电力和设置适当的反应电力来确定一个非拥塞的可行电网状态。与最优功率流(OPF)等先进方法相比,所提出的 method 可以在稀疏信息下做出决策,只需观察网格中的少数几条公交车。配电电网通常尚未完全实现数字化和观测,因此这种方法可以用于大多数低电压电网的决策。在实际低电压电网上,该方法可以完全解决电压带内的100%违规行为和98.8%的资产过载。结果表明,在保证足够质量的电网运行的情况下,也可以在实际电网上做出决策。
https://arxiv.org/abs/2405.03262
Reinforcement Learning is a promising tool for learning complex policies even in fast-moving and object-interactive domains where human teleoperation or hard-coded policies might fail. To effectively reflect this challenging category of tasks, we introduce a dynamic, interactive RL testbed based on robot air hockey. By augmenting air hockey with a large family of tasks ranging from easy tasks like reaching, to challenging ones like pushing a block by hitting it with a puck, as well as goal-based and human-interactive tasks, our testbed allows a varied assessment of RL capabilities. The robot air hockey testbed also supports sim-to-real transfer with three domains: two simulators of increasing fidelity and a real robot system. Using a dataset of demonstration data gathered through two teleoperation systems: a virtualized control environment, and human shadowing, we assess the testbed with behavior cloning, offline RL, and RL from scratch.
强化学习在快速移动和物体交互领域中学习复杂策略是非常有前途的工具,即使在这种情况下,人类遥控或硬编码策略也可能会失败。为了有效反映这种具有挑战性的任务类别,我们基于机器人冰球引入了一个动态、交互式的RL测试平台。通过增加一个大型任务家族,从简单的任务(如达到)到具有挑战性的任务(如用球推动一个块),以及基于目标和人类交互的任务,我们的测试平台允许对RL能力进行多样评估。机器人冰球测试平台还支持从模拟器到实物的转移,包括不断提高模拟器精度的两个模拟器和一个真实机器人系统。通过通过两个遥控系统收集的演示数据,我们使用行为克隆、离线RL和从头开始RL对测试平台进行评估。
https://arxiv.org/abs/2405.03113
Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
深度强化学习(DRL)在现实应用中扮演着越来越重要的角色。然而,为了在复杂任务中获得最优的DRL代理器,特别是稀疏奖励,仍然是一个具有挑战性的问题。DRL代理商的训练常常陷入瓶颈,没有进一步的进步。在本文中,我们提出了RICE,一种创新的强化学习优化方案,它引入了解释方法来突破训练瓶颈。RICE的高层次思路是在组合默认初始状态和通过解释方法确定的临界状态的基础上构建一个新的初始状态分布,从而鼓励代理商从混合初始状态中进行探索。通过仔细的设计,我们可以理论上将我们的优化方案的子最优解界变得更小。我们在各种流行RL环境和现实应用中评估了RICE。结果表明,RICE在提高代理商性能方面显著优于现有的优化方案。
https://arxiv.org/abs/2405.03064
In safe Reinforcement Learning (RL), safety cost is typically defined as a function dependent on the immediate state and actions. In practice, safety constraints can often be non-Markovian due to the insufficient fidelity of state representation, and safety cost may not be known. We therefore address a general setting where safety labels (e.g., safe or unsafe) are associated with state-action trajectories. Our key contributions are: first, we design a safety model that specifically performs credit assignment to assess contributions of partial state-action trajectories on safety. This safety model is trained using a labeled safety dataset. Second, using RL-as-inference strategy we derive an effective algorithm for optimizing a safe policy using the learned safety model. Finally, we devise a method to dynamically adapt the tradeoff coefficient between reward maximization and safety compliance. We rewrite the constrained optimization problem into its dual problem and derive a gradient-based method to dynamically adjust the tradeoff coefficient during training. Our empirical results demonstrate that this approach is highly scalable and able to satisfy sophisticated non-Markovian safety constraints.
在安全的强化学习(RL)中,安全性成本通常被定义为一个依赖于当前状态和动作的函数。在实践中,由于状态表示的不充分性,安全约束往往是不满足马尔可夫性质的,而且安全性成本可能无法知道。因此,我们提出一个通用设置,其中安全标签(例如,安全或不可靠)与状态-动作轨迹相关联。我们的关键贡献是:首先,我们设计了一个安全模型,专门对部分状态-动作轨迹进行贡献评估,以评估其在安全性方面的贡献。这个安全模型通过标记的安全数据集进行训练。其次,使用RL-as-inference策略,我们推导出使用学习到的安全模型优化安全策略的有效算法。最后,我们设计了一种方法,动态地调整奖励最大化与安全性满足之间的权衡。我们将约束优化问题转化为其对偶问题,并基于梯度动态调整权衡系数。我们的实证结果表明,这种方法具有高度可扩展性,并能够满足复杂非马尔可夫安全性约束。
https://arxiv.org/abs/2405.03005