The training, testing, and deployment, of autonomous vehicles requires realistic and efficient simulators. Moreover, because of the high variability between different problems presented in different autonomous systems, these simulators need to be easy to use, and easy to modify. To address these problems we introduce TorchDriveSim and its benchmark extension TorchDriveEnv. TorchDriveEnv is a lightweight reinforcement learning benchmark programmed entirely in Python, which can be modified to test a number of different factors in learned vehicle behavior, including the effect of varying kinematic models, agent types, and traffic control patterns. Most importantly unlike many replay based simulation approaches, TorchDriveEnv is fully integrated with a state of the art behavioral simulation API. This allows users to train and evaluate driving models alongside data driven Non-Playable Characters (NPC) whose initializations and driving behavior are reactive, realistic, and diverse. We illustrate the efficiency and simplicity of TorchDriveEnv by evaluating common reinforcement learning baselines in both training and validation environments. Our experiments show that TorchDriveEnv is easy to use, but difficult to solve.
自动驾驶车辆的训练、测试和部署需要真实的和高效的仿真器。此外,由于不同自治系统中呈现的问题之间的差异很大,这些仿真器需要易于使用且易于修改。为解决这些问题,我们引入了TorchDriveSim和其基准扩展TorchDriveEnv。TorchDriveEnv是一个完全用Python编写的轻量级强化学习基准,可以修改以测试学习到的车辆行为的多个不同因素,包括影响运动模型的不同,代理类型和交通控制模式。最重要的是,与许多基于回放的仿真方法不同,TorchDriveEnv完全集成了最先进的 behavioral simulation API。这使得用户可以在数据驱动的非玩家角色(NPC)的初始化和驾驶行为反应实时、真实和多样化的同时训练和评估驾驶模型。我们通过在训练和验证环境中评估常见的强化学习基准来说明TorchDriveEnv的易用性和简单性。我们的实验结果表明,TorchDriveEnv很容易使用,但很难解决。
https://arxiv.org/abs/2405.04491
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at "this https URL.
我们提出了DeepSeek-V2,一种强大的Mixture-of-Experts(MoE)语言模型,具有经济实惠的训练和高效的推理。它包括236B总参数,其中21B针对每个词条激活,并支持128K个词条的上下文长度。DeepSeek-V2采用了创新架构,包括多头潜在注意力(MLA)和DeepSeekMoE。MLA通过显著压缩键值(KV)缓存将其压缩为潜在向量,从而保证高效的推理,而DeepSeekMoE则通过稀疏计算以经济实惠的成本训练强大的模型。与DeepSeek 67B相比,DeepSeek-V2实现了显著的性能提升,同时降低了训练成本,将KV缓存减小了93.3%,并将最大生成吞吐量提高了5.76倍。我们对DeepSeek-V2进行了预训练,在由8.1T个词条组成的高质量多源语料库上,并进一步进行了监督微调(SFT)和强化学习(RL)以充分发掘其潜力。评估结果显示,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然在开源模型中实现了顶级性能。模型检查点可在“此https URL上找到。
https://arxiv.org/abs/2405.04434
Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
近年来,大型语言模型(LLMs)的发展为开发自然语言代理提供了强大的基础,但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗,这是我们指的欺骗性行为或陈述,包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中,通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息,远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演,我们(i)要么让兔子从隐藏的陷阱门中出来,要么(ii)我们的重点是,观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中,当它们被设计为在自然语言生成中具有欺骗能力时,展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标,我们通过强化学习框架开发了欺骗能力,并围绕语言哲学和认知心理学理论进行构建。我们发现,通过后续的对抗性交互试验, lobbyist代理的欺骗能力增加了~ 40%(相对),我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题,即代理可能会操纵人类以实现其预设目标。
https://arxiv.org/abs/2405.04325
Offline reinforcement learning (RL) provides a promising approach to avoid costly online interaction with the real environment. However, the performance of offline RL highly depends on the quality of the datasets, which may cause extrapolation error in the learning process. In many robotic applications, an inaccurate simulator is often available. However, the data directly collected from the inaccurate simulator cannot be directly used in offline RL due to the well-known exploration-exploitation dilemma and the dynamic gap between inaccurate simulation and the real environment. To address these issues, we propose a novel approach to combine the offline dataset and the inaccurate simulation data in a better manner. Specifically, we pre-train a generative adversarial network (GAN) model to fit the state distribution of the offline dataset. Given this, we collect data from the inaccurate simulator starting from the distribution provided by the generator and reweight the simulated data using the discriminator. Our experimental results in the D4RL benchmark and a real-world manipulation task confirm that our method can benefit more from both inaccurate simulator and limited offline datasets to achieve better performance than the state-of-the-art methods.
离线强化学习(RL)提供了一种有前途的方法来避免与真实环境进行昂贵的在线交互。然而,离线RL的性能高度依赖于数据质量,这可能导致学习过程中的扩展误差。在许多机器人应用中,通常缺乏准确的仿真器。然而,由于众所周知的学习-探索困境和仿真器和现实环境之间的动态差距,直接从不准确的仿真器中收集的数据无法直接用于离线RL。为解决这些问题,我们提出了一种结合离线数据和低质量仿真数据的新方法。具体来说,我们预训练了一个生成对抗网络(GAN)模型来适应离线数据的状态分布。然后,我们从生成器提供的分布开始收集离线仿真器的数据,并使用判别器对模拟数据进行重新加权。我们在D4RL基准和现实世界的操作任务中的实验结果证实,我们的方法可以从不准确的仿真器和有限离线数据中获得更好的性能,比最先进的方法具有更高的性能。
https://arxiv.org/abs/2405.04307
One of the key challenges in current Reinforcement Learning (RL)-based Automated Driving (AD) agents is achieving flexible, precise, and human-like behavior cost-effectively. This paper introduces an innovative approach utilizing Large Language Models (LLMs) to intuitively and effectively optimize RL reward functions in a human-centric way. We developed a framework where instructions and dynamic environment descriptions are input into the LLM. The LLM then utilizes this information to assist in generating rewards, thereby steering the behavior of RL agents towards patterns that more closely resemble human driving. The experimental results demonstrate that this approach not only makes RL agents more anthropomorphic but also reaches better performance. Additionally, various strategies for reward-proxy and reward-shaping are investigated, revealing the significant impact of prompt design on shaping an AD vehicle's behavior. These findings offer a promising direction for the development of more advanced and human-like automated driving systems. Our experimental data and source code can be found here.
https://arxiv.org/abs/2405.04135
Recent advances in robot skill learning have unlocked the potential to construct task-agnostic skill libraries, facilitating the seamless sequencing of multiple simple manipulation primitives (aka. skills) to tackle significantly more complex tasks. Nevertheless, determining the optimal sequence for independently learned skills remains an open problem, particularly when the objective is given solely in terms of the final geometric configuration rather than a symbolic goal. To address this challenge, we propose Logic-Skill Programming (LSP), an optimization-based approach that sequences independently learned skills to solve long-horizon tasks. We formulate a first-order extension of a mathematical program to optimize the overall cumulative reward of all skills within a plan, abstracted by the sum of value functions. To solve such programs, we leverage the use of Tensor Train to construct the value function space, and rely on alternations between symbolic search and skill value optimization to find the appropriate skill skeleton and optimal subgoal sequence. Experimental results indicate that the obtained value functions provide a superior approximation of cumulative rewards compared to state-of-the-art Reinforcement Learning methods. Furthermore, we validate LSP in three manipulation domains, encompassing both prehensile and non-prehensile primitives. The results demonstrate its capability to identify the optimal solution over the full logic and geometric path. The real-robot experiments showcase the effectiveness of our approach to cope with contact uncertainty and external disturbances in the real world.
近年来,机器人技能学习的进步解锁了构建任务无关技能库的潜力,从而使多步简单操作原语(即技能)的无缝排序变得容易,从而应对显著更加复杂任务的处理。然而,确定独立学习的技能的最佳顺序仍然是一个开放问题,尤其是在目标仅基于最终几何配置而不是符号目标时。为解决这一挑战,我们提出了逻辑技能编程(LSP),一种基于优化的方法,将独立学习的技能序列化为解决长期时间间隔任务的解决方案。我们将一个数学程序的第一阶扩展用于优化整个累积奖励的所有技能,将其抽象为一个价值函数的和。为解决这类问题,我们利用了Tensor Train构建价值函数空间,并依赖于符号搜索和技能价值优化之间的交替来寻找适当的技能骨架和最优子目标序列。实验结果表明,所得到的价值函数能够比最先进的强化学习方法提供更好的累积奖励逼近。此外,我们在三个操作域中验证了LSP,包括抓握和非抓握原语。结果表明,我们的方法在整个逻辑和几何路径上能够找到最优解决方案。在现实世界的实验中,我们的方法应对接触不确定性和外部干扰的有效性得到了展示。
https://arxiv.org/abs/2405.04082
Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online learning paradigm prevents its widespread adoption, especially in hazardous or costly scenarios. Offline RL has emerged as an alternative solution, learning from pre-collected static datasets. However, this offline learning introduces a new challenge known as distributional shift, degrading the performance when the policy is evaluated on scenarios that are Out-Of-Distribution (OOD) from the training dataset. Most existing offline RL resolves this issue by regularizing policy learning within the information supported by the given dataset. However, such regularization overlooks the potential for high-reward regions that may exist beyond the dataset. This motivates exploring novel offline learning techniques that can make improvements beyond the data support without compromising policy performance, potentially by learning causation (cause-and-effect) instead of correlation from the dataset. In this paper, we propose the MOOD-CRL (Model-based Offline OOD-Adapting Causal RL) algorithm, which aims to address the challenge of extrapolation for offline policy training through causal inference instead of policy-regularizing methods. Specifically, Causal Normalizing Flow (CNF) is developed to learn the transition and reward functions for data generation and augmentation in offline policy evaluation and training. Based on the data-invariant, physics-based qualitative causal graph and the observational data, we develop a novel learning scheme for CNF to learn the quantitative structural causal model. As a result, CNF gains predictive and counterfactual reasoning capabilities for sequential decision-making tasks, revealing a high potential for OOD adaptation. Our CNF-based offline RL approach is validated through empirical evaluations, outperforming model-free and model-based methods by a significant margin.
尽管强化学习(RL)取得了显著的成功,但在线学习范式普遍使用导致其广泛应用受限,尤其是在危险或昂贵的情景中。离线强化学习(Offline RL)作为一种替代方案应运而生,通过预先收集的静态数据集进行学习。然而,这种离线学习引入了一个名为分布平滑的新挑战,当策略在训练数据集之外的场景上评估时,会降低其性能。为解决此问题,大多数现有的离线强化学习方法通过在给定数据集支持的范围内对策略进行规范化来解决。然而,这种规范化方法忽视了数据之外可能存在高奖励区域的事实。因此,探索新的离线学习方法具有提高数据支持下的策略性能而不会牺牲策略性能潜力,通过从数据中学习因果关系(原因和结果)来解决此问题。在本文中,我们提出了MOOD-CRL(基于模型的离线OUD适应因果RL)算法,旨在通过因果推理而不是策略规范化方法来解决离线策略训练的扩展问题。具体来说,我们开发了因果正常化流(CNF)来学习数据生成和增强在离线策略评估和训练中的转移和奖励函数。基于数据无关、基于物理的定性因果图和观测数据,我们为CNF开发了一种新的学习方案,以学习量化结构因果模型。因此,CNF在序列决策任务中获得了预测和反事实推理能力,揭示了其在大数据迁移方面的巨大潜力。我们的基于CNF的离线强化学习方法通过实证评估证明了比模型免费和基于模型的方法具有显著的优越性。
https://arxiv.org/abs/2405.03892
Adversarial machine learning, focused on studying various attacks and defenses on machine learning (ML) models, is rapidly gaining importance as ML is increasingly being adopted for optimizing wireless systems such as Open Radio Access Networks (O-RAN). A comprehensive modeling of the security threats and the demonstration of adversarial attacks and defenses on practical AI based O-RAN systems is still in its nascent stages. We begin by conducting threat modeling to pinpoint attack surfaces in O-RAN using an ML-based Connection management application (xApp) as an example. The xApp uses a Graph Neural Network trained using Deep Reinforcement Learning and achieves on average 54% improvement in the coverage rate measured as the 5th percentile user data rates. We then formulate and demonstrate evasion attacks that degrade the coverage rates by as much as 50% through injecting bounded noise at different threat surfaces including the open wireless medium itself. Crucially, we also compare and contrast the effectiveness of such attacks on the ML-based xApp and a non-ML based heuristic. We finally develop and demonstrate robust training-based defenses against the challenging physical/jamming-based attacks and show a 15% improvement in the coverage rates when compared to employing no defense over a range of noise budgets
对抗性机器学习(Adversarial machine learning)专注于研究各种针对机器学习(ML)模型的攻击和防御方法,随着越来越多的无线系统采用ML进行优化,对抗性机器学习在无线系统(如开放式无线接入网络O-RAN)中的应用正日益具有重要意义。全面建模安全威胁以及实际基于AI的O-RAN系统上的对抗性攻击和防御仍然是萌芽阶段。我们首先使用基于ML的连接管理应用程序(xApp)进行威胁建模,以确定O-RAN中的攻击面。xApp使用使用深度强化学习训练的图神经网络,实现了平均54%的用户数据速率覆盖率的增长。接着,我们formulate和demonstrate通过在不同的威胁面包括无线通信介质本身中注入有界噪声来降低覆盖率的攻击。关键是,我们还比较和对比了这种攻击在基于ML的xApp和非基于ML的启发式方法上的效果。最后,我们开发和展示了对抗性防御,用于应对具有挑战性的物理/干扰 based攻击,并展示了与不采取任何防御措施时的覆盖率相比,覆盖率提高了15%。
https://arxiv.org/abs/2405.03891
Undeniably, Large Language Models (LLMs) have stirred an extraordinary wave of innovation in the machine learning research domain, resulting in substantial impact across diverse fields such as reinforcement learning, robotics, and computer vision. Their incorporation has been rapid and transformative, marking a significant paradigm shift in the field of machine learning research. However, the field of experimental design, grounded on black-box optimization, has been much less affected by such a paradigm shift, even though integrating LLMs with optimization presents a unique landscape ripe for exploration. In this position paper, we frame the field of black-box optimization around sequence-based foundation models and organize their relationship with previous literature. We discuss the most promising ways foundational language models can revolutionize optimization, which include harnessing the vast wealth of information encapsulated in free-form text to enrich task comprehension, utilizing highly flexible sequence models such as Transformers to engineer superior optimization strategies, and enhancing performance prediction over previously unseen search spaces.
毫无疑问,大型语言模型(LLMs)在机器学习领域引起了非凡的创新波,对诸如强化学习、机器人学和计算机视觉等各个领域产生了重大影响。它们的出现速度之快、影响之深,标志着机器学习研究领域的范式发生了重大转变。然而,基于黑盒优化的实验设计领域受到的影响要小得多,尽管将LLM与优化相结合,为该领域开拓了独特的探索景观。在本文论文中,我们将围绕基于序列的基础模型来组织黑色盒优化领域,并探讨LLM与之前文献的关系。我们讨论了LLM可以如何通过利用自由的文本中蕴含的丰富信息来改善任务理解,使用具有高度灵活性的序列模型(如Transformer)来设计卓越的优化策略,以及通过增强以前未曾见过的搜索空间的表现来提高性能预测等最具潜力的途径。
https://arxiv.org/abs/2405.03547
Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.
强化学习(RL)通过环境交互来学习策略是一个有前途的框架,但通常需要大量的交互数据来解决复杂任务。一方面包括通过离线数据增强 RL,以展示所需任务的强化学习,但先前的研究表明,通常需要大量高质量的演示数据,这很难获得,尤其是在机器人领域。我们的方法包括一个反向课程和一个正向课程。与先前的研究相比,我们方法的独特之处在于能够通过通过状态重置生成的每个演示的逆向课程来有效地利用多个演示。反向课程的结果是一个在窄的初始状态分布上表现良好的初始策略,并帮助克服困难的探索问题。然后使用正向课程来加速初始策略的训练,以在任务的全初始状态分布上表现良好,并提高演示和采样效率。我们证明了,在我们的方法 RFCL 中,结合反向课程和正向课程,能够显著提高演示和采样效率,与各种从演示中学习的基线相比,甚至解决了以前无法解决的任务,这些任务需要高精度和控制。
https://arxiv.org/abs/2405.03379
Q-learning excels in learning from feedback within sequential decision-making tasks but requires extensive sampling for significant improvements. Although reward shaping is a powerful technique for enhancing learning efficiency, it can introduce biases that affect agent performance. Furthermore, potential-based reward shaping is constrained as it does not allow for reward modifications based on actions or terminal states, potentially limiting its effectiveness in complex environments. Additionally, large language models (LLMs) can achieve zero-shot learning, but this is generally limited to simpler tasks. They also exhibit low inference speeds and occasionally produce hallucinations. To address these issues, we propose \textbf{LLM-guided Q-learning} that employs LLMs as heuristic to aid in learning the Q-function for reinforcement learning. It combines the advantages of both technologies without introducing performance bias. Our theoretical analysis demonstrates that the LLM heuristic provides action-level guidance. Additionally, our architecture has the capability to convert the impact of hallucinations into exploration costs. Moreover, the converged Q function corresponds to the MDP optimal Q function. Experiment results demonstrated that our algorithm enables agents to avoid ineffective exploration, enhances sampling efficiency, and is well-suited for complex control tasks.
Q-learning 在序列决策任务中从反馈中学习表现出色,但需要进行广泛的抽样以实现显著的改进。尽管奖励塑造是一种强大的学习技术,但它可能会引入偏差,影响代理的表现。此外,基于潜在的奖励塑造方法受到约束,因为它不允许根据动作或状态进行奖励修改,这可能限制其在复杂环境中的有效性。此外,大型语言模型(LLMs)可以实现零散学习,但通常仅限于较简单的任务。它们还表现出低推理速度,偶尔产生幻觉。为了应对这些问题,我们提出了 \textbf{LLM-guided Q-learning} 方法,它使用 LLMs 作为启发式,帮助学习者学习 Q 函数,实现强化学习。它结合了两种技术的优势,而没有引入性能偏差。我们的理论分析表明,LLM 启发式提供了动作级别指导。此外,我们的架构具有将幻觉影响转换为探索成本的能力。此外,收敛的 Q 函数与 MDP 最优 Q 函数相等。实验结果表明,我们的算法使代理能够避免无效探索,提高抽样效率,并适用于复杂的控制任务。
https://arxiv.org/abs/2405.03341
Purpose: Autonomous navigation of devices in endovascular interventions can decrease operation times, improve decision-making during surgery, and reduce operator radiation exposure while increasing access to treatment. This systematic review explores recent literature to assess the impact, challenges, and opportunities artificial intelligence (AI) has for the autonomous endovascular intervention navigation. Methods: PubMed and IEEEXplore databases were queried. Eligibility criteria included studies investigating the use of AI in enabling the autonomous navigation of catheters/guidewires in endovascular interventions. Following PRISMA, articles were assessed using QUADAS-2. PROSPERO: CRD42023392259. Results: Among 462 studies, fourteen met inclusion criteria. Reinforcement learning (9/14, 64%) and learning from demonstration (7/14, 50%) were used as data-driven models for autonomous navigation. Studies predominantly utilised physical phantoms (10/14, 71%) and in silico (4/14, 29%) models. Experiments within or around the blood vessels of the heart were reported by the majority of studies (10/14, 71%), while simple non-anatomical vessel platforms were used in three studies (3/14, 21%), and the porcine liver venous system in one study. We observed that risk of bias and poor generalisability were present across studies. No procedures were performed on patients in any of the studies reviewed. Studies lacked patient selection criteria, reference standards, and reproducibility, resulting in low clinical evidence levels. Conclusions: AI's potential in autonomous endovascular navigation is promising, but in an experimental proof-of-concept stage, with a technology readiness level of 3. We highlight that reference standards with well-identified performance metrics are crucial to allow for comparisons of data-driven algorithms proposed in the years to come.
目的:在介入治疗中,自主导航设备的操作时间可以缩短,手术过程中的决策可以得到改善,同时辐射剂量可以降低,同时提高治疗的可获取性。本系统综述评估了近年来关于人工智能(AI)在自主导航穿刺器/引导线在介入治疗中的影响、挑战和机会的文献,以评估AI在自主导航穿刺器/引导线在介入治疗中的潜在影响。方法:PubMed和IEEE Explore数据库进行查询。符合资格标准的研究包括研究使用AI促进自主导航穿刺器/引导线在介入治疗中的应用。然后使用PRISMA和QUADAS-2对文章进行评估。PROSPERO: CRD42023392259。结果:在462篇论文中,有14篇符合资格标准。强化学习(9/14,64%)和学习演示(7/14,50%)被用作数据驱动模型进行自主导航。研究主要使用物理幻象(10/14,71%)和仿真(4/14,29%)模型。大多数研究(10/14,71%)报道了心脏血管内实验,而三篇研究(3/14,21%)使用了简单非解剖性血管平台,一篇研究(1/14,7)使用了猪肝静脉系统。我们观察到,研究中的偏见和普遍性存在。在所有审查的研究中,没有对患者进行任何操作。研究缺乏患者选择标准、参考标准和可重复性,导致临床证据水平较低。结论:AI在自主导航介入治疗中的潜在影响是积极的,但目前仍处于实验验证阶段,技术成熟度为3。我们强调,具有明确定义的性能指标的参考标准对于允许未来数据驱动算法的比较至关重要。
https://arxiv.org/abs/2405.03305
In the course of the energy transition, the expansion of generation and consumption will change, and many of these technologies, such as PV systems, electric cars and heat pumps, will influence the power flow, especially in the distribution grids. Scalable methods that can make decisions for each grid connection are needed to enable congestion-free grid operation in the distribution grids. This paper presents a novel end-to-end approach to resolving congestion in distribution grids with deep reinforcement learning. Our architecture learns to curtail power and set appropriate reactive power to determine a non-congested and, thus, feasible grid state. State-of-the-art methods such as the optimal power flow (OPF) demand high computational costs and detailed measurements of every bus in a grid. In contrast, the presented method enables decisions under sparse information with just some buses observable in the grid. Distribution grids are generally not yet fully digitized and observable, so this method can be used for decision-making on the majority of low-voltage grids. On a real low-voltage grid the approach resolves 100\% of violations in the voltage band and 98.8\% of asset overloads. The results show that decisions can also be made on real grids that guarantee sufficient quality for congestion-free grid operation.
在能源转型的过程中,发电和消费的扩张将发生变化,许多技术,如光伏系统、电动汽车和热泵,将影响电力流动,特别是在配电电网。可扩展的方法需要为每个电网连接做出决策,以便在配电电网中实现无拥塞的电网运行。本文介绍了一种用深度强化学习解决配电电网拥塞的新型端到端方法。我们的架构学会了抑制电力和设置适当的反应电力来确定一个非拥塞的可行电网状态。与最优功率流(OPF)等先进方法相比,所提出的 method 可以在稀疏信息下做出决策,只需观察网格中的少数几条公交车。配电电网通常尚未完全实现数字化和观测,因此这种方法可以用于大多数低电压电网的决策。在实际低电压电网上,该方法可以完全解决电压带内的100%违规行为和98.8%的资产过载。结果表明,在保证足够质量的电网运行的情况下,也可以在实际电网上做出决策。
https://arxiv.org/abs/2405.03262
Reinforcement Learning is a promising tool for learning complex policies even in fast-moving and object-interactive domains where human teleoperation or hard-coded policies might fail. To effectively reflect this challenging category of tasks, we introduce a dynamic, interactive RL testbed based on robot air hockey. By augmenting air hockey with a large family of tasks ranging from easy tasks like reaching, to challenging ones like pushing a block by hitting it with a puck, as well as goal-based and human-interactive tasks, our testbed allows a varied assessment of RL capabilities. The robot air hockey testbed also supports sim-to-real transfer with three domains: two simulators of increasing fidelity and a real robot system. Using a dataset of demonstration data gathered through two teleoperation systems: a virtualized control environment, and human shadowing, we assess the testbed with behavior cloning, offline RL, and RL from scratch.
强化学习在快速移动和物体交互领域中学习复杂策略是非常有前途的工具,即使在这种情况下,人类遥控或硬编码策略也可能会失败。为了有效反映这种具有挑战性的任务类别,我们基于机器人冰球引入了一个动态、交互式的RL测试平台。通过增加一个大型任务家族,从简单的任务(如达到)到具有挑战性的任务(如用球推动一个块),以及基于目标和人类交互的任务,我们的测试平台允许对RL能力进行多样评估。机器人冰球测试平台还支持从模拟器到实物的转移,包括不断提高模拟器精度的两个模拟器和一个真实机器人系统。通过通过两个遥控系统收集的演示数据,我们使用行为克隆、离线RL和从头开始RL对测试平台进行评估。
https://arxiv.org/abs/2405.03113
Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
深度强化学习(DRL)在现实应用中扮演着越来越重要的角色。然而,为了在复杂任务中获得最优的DRL代理器,特别是稀疏奖励,仍然是一个具有挑战性的问题。DRL代理商的训练常常陷入瓶颈,没有进一步的进步。在本文中,我们提出了RICE,一种创新的强化学习优化方案,它引入了解释方法来突破训练瓶颈。RICE的高层次思路是在组合默认初始状态和通过解释方法确定的临界状态的基础上构建一个新的初始状态分布,从而鼓励代理商从混合初始状态中进行探索。通过仔细的设计,我们可以理论上将我们的优化方案的子最优解界变得更小。我们在各种流行RL环境和现实应用中评估了RICE。结果表明,RICE在提高代理商性能方面显著优于现有的优化方案。
https://arxiv.org/abs/2405.03064
In safe Reinforcement Learning (RL), safety cost is typically defined as a function dependent on the immediate state and actions. In practice, safety constraints can often be non-Markovian due to the insufficient fidelity of state representation, and safety cost may not be known. We therefore address a general setting where safety labels (e.g., safe or unsafe) are associated with state-action trajectories. Our key contributions are: first, we design a safety model that specifically performs credit assignment to assess contributions of partial state-action trajectories on safety. This safety model is trained using a labeled safety dataset. Second, using RL-as-inference strategy we derive an effective algorithm for optimizing a safe policy using the learned safety model. Finally, we devise a method to dynamically adapt the tradeoff coefficient between reward maximization and safety compliance. We rewrite the constrained optimization problem into its dual problem and derive a gradient-based method to dynamically adjust the tradeoff coefficient during training. Our empirical results demonstrate that this approach is highly scalable and able to satisfy sophisticated non-Markovian safety constraints.
在安全的强化学习(RL)中,安全性成本通常被定义为一个依赖于当前状态和动作的函数。在实践中,由于状态表示的不充分性,安全约束往往是不满足马尔可夫性质的,而且安全性成本可能无法知道。因此,我们提出一个通用设置,其中安全标签(例如,安全或不可靠)与状态-动作轨迹相关联。我们的关键贡献是:首先,我们设计了一个安全模型,专门对部分状态-动作轨迹进行贡献评估,以评估其在安全性方面的贡献。这个安全模型通过标记的安全数据集进行训练。其次,使用RL-as-inference策略,我们推导出使用学习到的安全模型优化安全策略的有效算法。最后,我们设计了一种方法,动态地调整奖励最大化与安全性满足之间的权衡。我们将约束优化问题转化为其对偶问题,并基于梯度动态调整权衡系数。我们的实证结果表明,这种方法具有高度可扩展性,并能够满足复杂非马尔可夫安全性约束。
https://arxiv.org/abs/2405.03005
Currently, the generative model has garnered considerable attention due to its application in addressing the challenge of scarcity of abnormal samples in the industrial Internet of Things (IoT). However, challenges persist regarding the edge deployment of generative models and the optimization of joint edge AI-generated content (AIGC) tasks. In this paper, we focus on the edge optimization of AIGC task execution and propose GMEL, a generative model-driven industrial AIGC collaborative edge learning framework. This framework aims to facilitate efficient few-shot learning by leveraging realistic sample synthesis and edge-based optimization capabilities. First, a multi-task AIGC computational offloading model is presented to ensure the efficient execution of heterogeneous AIGC tasks on edge servers. Then, we propose an attention-enhanced multi-agent reinforcement learning (AMARL) algorithm aimed at refining offloading policies within the IoT system, thereby supporting generative model-driven edge learning. Finally, our experimental results demonstrate the effectiveness of the proposed algorithm in optimizing the total system latency of the edge-based AIGC task completion.
目前,由于其在解决工业物联网(IoT)中异常样本 scarcity 的问题而受到了相当的关注。然而,关于生成模型的边缘部署和联合边缘 AI 生成的内容(AIGC)任务的优化问题仍然存在挑战。在本文中,我们重点关注了 AIGC 任务执行的边缘优化,并提出了 GMEL,一种基于生成模型的工业 AIGC 协同边缘学习框架。这个框架旨在通过利用真实的样本合成和边缘优化的能力来促进高效的少样本学习。首先,我们提出了一个多任务 AIGC 计算卸载模型,以确保在边缘服务器上高效执行异构性的 AIGC 任务。然后,我们提出了一种注意力增强的多代理强化学习(AMARL)算法,旨在在物联网系统内优化卸载策略,从而支持基于生成模型的边缘学习。最后,我们的实验结果证明了所提出的算法的有效性,即在优化基于边缘的 AIGC 任务完成时,可以降低系统的延迟总和。
https://arxiv.org/abs/2405.02972
Exploring complex adaptive financial trading environments through multi-agent based simulation methods presents an innovative approach within the realm of quantitative finance. Despite the dominance of multi-agent reinforcement learning approaches in financial markets with observable data, there exists a set of systematically significant financial markets that pose challenges due to their partial or obscured data availability. We, therefore, devise a multi-agent simulation approach employing small-scale meta-heuristic methods. This approach aims to represent the opaque bilateral market for Australian government bond trading, capturing the bilateral nature of bank-to-bank trading, also referred to as "over-the-counter" (OTC) trading, and commonly occurring between "market makers". The uniqueness of the bilateral market, characterized by negotiated transactions and a limited number of agents, yields valuable insights for agent-based modelling and quantitative finance. The inherent rigidity of this market structure, which is at odds with the global proliferation of multilateral platforms and the decentralization of finance, underscores the unique insights offered by our agent-based model. We explore the implications of market rigidity on market structure and consider the element of stability, in market design. This extends the ongoing discourse on complex financial trading environments, providing an enhanced understanding of their dynamics and implications.
通过基于多智能体(multi-agent)的仿真方法探索复杂适应金融交易环境是一种在量化金融领域具有创新性的方法。尽管在具有观测数据的市场中,多智能体强化学习方法占据主导地位,但存在一组由于部分或难以获得数据而具有系统性地重要性的金融市场。因此,我们设计了一种基于元启发式方法的多智能体仿真方法。该方法旨在代表澳大利亚政府债券交易的双边市场,捕捉到银行间交易的双边性质,也称为“场外”(OTC) 交易,以及通常在市场制造商之间发生的双边交易。双边市场的独特性,其特点是有协议的交易和有限的代理数量,为基于智能体的建模和量化金融提供了宝贵的见解。市场结构的固有刚性,与其与全球多边平台和金融市场的分散化相矛盾,强调了我们的基于智能体的模型所提供的独特见解。我们探讨了市场刚性对市场结构和市场设计的影响。这扩展了关于复杂金融交易环境的持续讨论,提供了对它们动态和影响的更深入了解。
https://arxiv.org/abs/2405.02849
Deep reinforcement learning (DRL) has demonstrated remarkable performance in many continuous control tasks. However, a significant obstacle to the real-world application of DRL is the lack of safety guarantees. Although DRL agents can satisfy system safety in expectation through reward shaping, designing agents to consistently meet hard constraints (e.g., safety specifications) at every time step remains a formidable challenge. In contrast, existing work in the field of safe control provides guarantees on persistent satisfaction of hard safety constraints. However, these methods require explicit analytical system dynamics models to synthesize safe control, which are typically inaccessible in DRL settings. In this paper, we present a model-free safe control algorithm, the implicit safe set algorithm, for synthesizing safeguards for DRL agents that ensure provable safety throughout training. The proposed algorithm synthesizes a safety index (barrier certificate) and a subsequent safe control law solely by querying a black-box dynamic function (e.g., a digital twin simulator). Moreover, we theoretically prove that the implicit safe set algorithm guarantees finite time convergence to the safe set and forward invariance for both continuous-time and discrete-time systems. We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark, where it achieves zero safety violations while gaining $95\% \pm 9\%$ cumulative reward compared to state-of-the-art safe DRL methods. Furthermore, the resulting algorithm scales well to high-dimensional systems with parallel computing.
深度强化学习(DRL)在许多连续控制任务中表现出显著的性能。然而,DRL在现实世界应用中的一大障碍是缺乏安全性保证。尽管DRL代理可以通过奖励塑造满足系统安全,但设计一个在每一步都确保达到严格约束的安全代理仍然具有挑战性。相比之下,该领域现有的安全控制方法提供了对严格安全约束的持续满足的保证。然而,这些方法需要显式地分析系统动态模型来合成安全控制,这在DRL环境中通常是不可访问的。在本文中,我们提出了一个模型-无关的安全控制算法,称为隐式安全集算法,用于为DRL代理合成训练过程中的安全保障。所提出的算法通过查询黑盒动态函数(例如数字双胞胎模拟器)生成安全指数和安全控制律。此外,我们理论证明,隐式安全集算法保证连续时间和离散时间系统的有限时间收敛和前馈不变性。我们在最新的Safety Gym基准上验证了所提出的算法,该算法在实现零安全违规的同时,与最先进的 safe DRL 方法相比获得了95% ± 9%的累积奖励。此外,该算法具有良好的扩展性,可应用于高维系统,通过并行计算实现。
https://arxiv.org/abs/2405.02754
Hand manipulating objects is an important interaction motion in our daily activities. We faithfully reconstruct this motion with a single RGBD camera by a novel deep reinforcement learning method to leverage physics. Firstly, we propose object compensation control which establishes direct object control to make the network training more stable. Meanwhile, by leveraging the compensation force and torque, we seamlessly upgrade the simple point contact model to a more physical-plausible surface contact model, further improving the reconstruction accuracy and physical correctness. Experiments indicate that without involving any heuristic physical rules, this work still successfully involves physics in the reconstruction of hand-object interactions which are complex motions hard to imitate with deep reinforcement learning. Our code and data are available at this https URL.
手操作物体是我们日常生活中的重要交互动作。我们通过一种新颖的深度强化学习方法,利用物理原理,重构了单个RGBD相机来捕捉这个动作。首先,我们提出了对象补偿控制,建立了直接物体控制,使得网络训练更加稳定。同时,通过利用补偿力和扭矩,我们将简单的点接触模型升级为更物理上合理的表面接触模型,进一步提高了重构精度和物理正确性。实验结果表明,在没有使用任何启发式物理规则的情况下,这项工作仍然成功地涉及了物理在重构手-物体交互过程中的应用,这些复杂动作很难通过深度强化学习来模仿。我们的代码和数据可在此处访问:https://www. this URL。
https://arxiv.org/abs/2405.02676