There has been a growing interest in developing learner models to enhance learning and teaching experiences in educational environments. However, existing works have primarily focused on structured environments relying on meticulously crafted representations of tasks, thereby limiting the agent's ability to generalize skills across tasks. In this paper, we aim to enhance the generalization capabilities of agents in open-ended text-based learning environments by integrating Reinforcement Learning (RL) with Large Language Models (LLMs). We investigate three types of agents: (i) RL-based agents that utilize natural language for state and action representations to find the best interaction strategy, (ii) LLM-based agents that leverage the model's general knowledge and reasoning through prompting, and (iii) hybrid LLM-assisted RL agents that combine these two strategies to improve agents' performance and generalization. To support the development and evaluation of these agents, we introduce PharmaSimText, a novel benchmark derived from the PharmaSim virtual pharmacy environment designed for practicing diagnostic conversations. Our results show that RL-based agents excel in task completion but lack in asking quality diagnostic questions. In contrast, LLM-based agents perform better in asking diagnostic questions but fall short of completing the task. Finally, hybrid LLM-assisted RL agents enable us to overcome these limitations, highlighting the potential of combining RL and LLMs to develop high-performing agents for open-ended learning environments.
近年来,在教育环境中开发学习者模型的兴趣逐渐增加。然而,现有的作品主要关注于依赖于精心设计的任务表示结构的场景,从而限制了代理在任务间泛化技能的能力。在本文中,我们旨在通过将强化学习(RL)与大型语言模型(LLM)相结合,提高开放性文本基学习环境中代理的泛化能力。我们研究了三种类型的代理:基于RL的代理(i)利用自然语言进行状态和动作表示来寻找最佳交互策略,基于LLM的代理(ii)利用模型的通用知识和推理通过提示,以及基于LLM辅助的RL代理(iii)结合这两种策略来提高代理的表现和泛化能力。为了支持这些代理的开发和评估,我们引入了PharmaSimText,一种基于PharmaSim虚拟药房环境的全新基准,用于练习诊断对话。我们的结果表明,基于RL的代理在任务完成方面表现优异,但在提出高质量诊断问题方面存在不足。相反,基于LLM的代理在提出高质量诊断问题方面表现更好,但未能完成任务。最后,混合LLM辅助的RL代理使我们能够克服这些限制,突出将RL和LLM结合以开发高性能代理的可能性,从而为开放性学习环境中的代理开发提供了新的思路。
https://arxiv.org/abs/2404.18978
Deep reinforcement learning (DRL) has shown remarkable success in simulation domains, yet its application in designing robot controllers remains limited, due to its single-task orientation and insufficient adaptability to environmental changes. To overcome these limitations, we present a novel adaptive agent that leverages transfer learning techniques to dynamically adapt policy in response to different tasks and environmental conditions. The approach is validated through the blimp control challenge, where multitasking capabilities and environmental adaptability are essential. The agent is trained using a custom, highly parallelized simulator built on IsaacGym. We perform zero-shot transfer to fly the blimp in the real world to solve various tasks. We share our code at \url{this https URL\_agent/}.
深度强化学习(DRL)在模拟领域取得了显著的成功,但在设计机器人控制器方面,其应用仍然有限,因为其单任务导向和对于环境变化的适应性不足。为了克服这些限制,我们提出了一种新型的自适应机器人,它利用迁移学习技术动态地适应策略以应对不同的任务和环境条件。通过在鸡舍控制挑战中进行验证,多任务能力和环境适应性对这种方法至关重要。该机器人使用基于IsaacGym的自定义高度并行化的模拟器进行训练。我们在各种任务上通过零散转移控制飞艇在现实世界中解决各种问题。我们的代码存储在\url{这个 https://this URL\_agent/} 这个网站上。
https://arxiv.org/abs/2404.18713
The natural interaction between robots and pedestrians in the process of autonomous navigation is crucial for the intelligent development of mobile robots, which requires robots to fully consider social rules and guarantee the psychological comfort of pedestrians. Among the research results in the field of robotic path planning, the learning-based socially adaptive algorithms have performed well in some specific human-robot interaction environments. However, human-robot interaction scenarios are diverse and constantly changing in daily life, and the generalization of robot socially adaptive path planning remains to be further investigated. In order to address this issue, this work proposes a new socially adaptive path planning algorithm by combining the generative adversarial network (GAN) with the Optimal Rapidly-exploring Random Tree (RRT*) navigation algorithm. Firstly, a GAN model with strong generalization performance is proposed to adapt the navigation algorithm to more scenarios. Secondly, a GAN model based Optimal Rapidly-exploring Random Tree navigation algorithm (GAN-RRT*) is proposed to generate paths in human-robot interaction environments. Finally, we propose a socially adaptive path planning framework named GAN-RTIRL, which combines the GAN model with Rapidly-exploring random Trees Inverse Reinforcement Learning (RTIRL) to improve the homotopy rate between planned and demonstration paths. In the GAN-RTIRL framework, the GAN-RRT* path planner can update the GAN model from the demonstration path. In this way, the robot can generate more anthropomorphic paths in human-robot interaction environments and has stronger generalization in more complex environments. Experimental results reveal that our proposed method can effectively improve the anthropomorphic degree of robot motion planning and the homotopy rate between planned and demonstration paths.
自动导航过程中机器人与行人之间的自然交互对移动机器人的智能发展至关重要。为了实现这一目标,机器人需要全面考虑社会规则并确保行人的心理舒适。在机器人路径规划领域的研究成果中,基于学习的社交适应算法在某些特定的人机交互环境中表现良好。然而,人机交互场景具有多样性,并且在日常生活中不断变化,因此对机器人社交适应路径规划的推广还需要进一步研究。为了解决这个问题,本文提出了一种将生成对抗网络(GAN)与最优快速探索随机树(RRT*)导航算法相结合的新社交适应路径规划算法。首先,提出了一个具有强大泛化性能的GAN模型,以适应更多的场景。其次,提出了一个基于最优快速探索随机树导航算法的GAN模型,用于生成人机交互环境中的路径。最后,我们提出了名为GAN-RTIRL的社会适应路径规划框架,将GAN模型与快速探索随机树逆强化学习(RTIRL)相结合,以提高计划路径和演示路径之间的同构性。在GAN-RTIRL框架中,GAN-RRT*路径规划器可以从演示路径更新GAN模型。这样,机器人可以在人机交互环境中生成更多具有人类特征的路径,在更复杂的环境中的泛化能力更强。实验结果表明,我们所提出的方法可以有效提高机器人运动规划的拟人程度和计划路径与演示路径之间的同构程度。
https://arxiv.org/abs/2404.18687
Large Language Models (LLMs) encapsulate an extensive amount of world knowledge, and this has enabled their application in various domains to improve the performance of a variety of Natural Language Processing (NLP) tasks. This has also facilitated a more accessible paradigm of conversation-based interactions between humans and AI systems to solve intended problems. However, one interesting avenue that shows untapped potential is the use of LLMs as Reinforcement Learning (RL) agents to enable conversational RL problem solving. Therefore, in this study, we explore the concept of formulating Markov Decision Process-based RL problems as LLM prompting tasks. We demonstrate how LLMs can be iteratively prompted to learn and optimize policies for specific RL tasks. In addition, we leverage the introduced prompting technique for episode simulation and Q-Learning, facilitated by LLMs. We then show the practicality of our approach through two detailed case studies for "Research Scientist" and "Legal Matter Intake" workflows.
大语言模型(LLMs)封装了大量的世界知识,这使得它们在各种领域中的应用不断提高自然语言处理(NLP)任务的性能。这也促进了一种更可接近的人机交互范式,以解决预定问题。然而,一个有趣的研究方向是使用LLMs作为强化学习(RL)代理,以实现对话式RL问题求解。因此,在本研究中,我们探讨了将形式化随机过程(RMDP)为基础的RL问题作为一个LLM提示任务的概念。我们证明了LLM可以通过迭代提示学习并优化特定RL任务的策略。此外,我们还利用LLM引入的提示技术进行状态模拟和Q-学习,并对其进行了演示。最后,我们通过两个详细的案例研究展示了我们方法的实际可行性,"研究科学家"和"法律事务接待"工作流程。
https://arxiv.org/abs/2404.18638
Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).
准确地模拟不同场景中异质代理人的多样行为是自动驾驶模拟的基本任务。由于行为分布的多维度性、驾驶场景的高维度性、分布变化和信息不完整,这个任务具有挑战性。我们的第一个洞察是通过可导模拟通过状态匹配提供有意义的学习信号,实现策略的有效分配。这可以通过揭示梯度高速公路和异质代理人的梯度通道来证明。然而,在低密度区域中发现了梯度爆炸和弱监督的问题。我们的第二个洞察是,通过应用双策略 regularization 缩小函数空间可以解决这些问题。再考虑多样性,我们第三个洞察是,数据集中的异质代理人的行为可以用一系列原型向量有效地压缩检索。这导致基于模型的强化模仿学习框架(MRIC)。MRIC 引入了开环模型基于仿真的 regularization 以稳定训练,以及基于模型的强化学习 (RL) 基于领域知识的 regularization。RL regularization 涉及可导的 Minkowskidifference-based 碰撞避免和基于投影的道路和交通规则遵守奖励。还进一步提出了动态乘数机制,消除 regularization 的干扰,同时确保其有效性。使用大型 Waymo 开放运动数据集进行实验研究,结果表明 MRIC 在多样性、行为真实性和分布真实性方面超过了最先进的基线,在某些关键指标(如碰撞率、minSADE 和时间到碰撞 JSD)上具有很大的优势(e.g., collision rate, minSADE, and time-to-collision JSD)。
https://arxiv.org/abs/2404.18464
While Deep Reinforcement Learning (DRL) has emerged as a promising solution for intricate control tasks, the lack of explainability of the learned policies impedes its uptake in safety-critical applications, such as automated driving systems (ADS). Counterfactual (CF) explanations have recently gained prominence for their ability to interpret black-box Deep Learning (DL) models. CF examples are associated with minimal changes in the input, resulting in a complementary output by the DL model. Finding such alternations, particularly for high-dimensional visual inputs, poses significant challenges. Besides, the temporal dependency introduced by the reliance of the DRL agent action on a history of past state observations further complicates the generation of CF examples. To address these challenges, we propose using a saliency map to identify the most influential input pixels across the sequence of past observed states by the agent. Then, we feed this map to a deep generative model, enabling the generation of plausible CFs with constrained modifications centred on the salient regions. We evaluate the effectiveness of our framework in diverse domains, including ADS, Atari Pong, Pacman and space-invaders games, using traditional performance metrics such as validity, proximity and sparsity. Experimental results demonstrate that this framework generates more informative and plausible CFs than the state-of-the-art for a wide range of environments and DRL agents. In order to foster research in this area, we have made our datasets and codes publicly available at this https URL.
尽管深度强化学习(DRL)已成为解决复杂控制任务的的有前景的解决方案,但所学习的策略的可解释性不足,这使得其在关键应用领域(如自动驾驶系统)中的应用受到限制。近年来,由于CF解释器能够解释黑盒深度学习(DL)模型的能力,它们在DL模型的可解释性方面受到了越来越多的关注。CF示例与输入的变化量较小,从而使DL模型具有互补的输出。在解决这种问题之前,尤其是在高维视觉输入的情况下,找到这样的变化是非常具有挑战性的。此外,DL代理商动作对过去状态观察历史依赖所引入的时间依赖性,进一步复杂了CF示例的生成。为了应对这些挑战,我们提出了使用局部重要性图来确定代理在整个过去观察状态序列中的最具影响力输入像素的方法。然后,我们将这个地图输入到深度生成模型中,使得模型的输出在显著区域上进行约束修改,从而生成合理的CF。我们使用传统性能度量标准(如有效性、接近度和稀疏性)评估我们在各种领域的框架的有效性。实验结果表明,我们的框架在广泛的環境和DRL代理商中产生了比现有状态更好的信息量和合理的CF。为了促进该领域的研究,我们将数据集和代码公开发布在https://这个链接上。
https://arxiv.org/abs/2404.18326
Recent work on decentralized computational trust models for open Multi Agent Systems has resulted in the development of CA, a biologically inspired model which focuses on the trustee's perspective. This new model addresses a serious unresolved problem in existing trust and reputation models, namely the inability to handle constantly changing behaviors and agents' continuous entry and exit from the system. In previous work, we compared CA to FIRE, a well-known trust and reputation model, and found that CA is superior when the trustor population changes, whereas FIRE is more resilient to the trustee population changes. Thus, in this paper, we investigate how the trustors can detect the presence of several dynamic factors in their environment and then decide which trust model to employ in order to maximize utility. We frame this problem as a machine learning problem in a partially observable environment, where the presence of several dynamic factors is not known to the trustor and we describe how an adaptable trustor can rely on a few measurable features so as to assess the current state of the environment and then use Deep Q Learning (DQN), in a single-agent Reinforcement Learning setting, to learn how to adapt to a changing environment. We ran a series of simulation experiments to compare the performance of the adaptable trustor with the performance of trustors using only one model (FIRE or CA) and we show that an adaptable agent is indeed capable of learning when to use each model and, thus, perform consistently in dynamic environments.
近年来,在去中心化计算信任模型为开放多智能系统的研究中,已经发展出了CA,一种以生物启发的模型,重点关注受托人的视角。这种新模型解决现有信任和声誉模型的一个严重问题,即无法处理不断变化的行为和代理商对系统的持续进入和退出。在之前的工作里,我们比较了CA与FIRE,一个著名的信任和声誉模型,发现在信任者人口变化时,CA更优越,而FIRE对受托人人口变化更加鲁棒。因此,在本文中,我们研究了信任者如何检测其环境中的多个动态因素,然后决定要使用哪种信任模型来最大化效用。我们将这个问题描述为在部分可观测环境中运行的机器学习问题,其中几个动态因素的存在对于信任者来说是不知道的,然后我们描述了一个可适应的信任者如何利用一些可观测的特征来评估当前环境状态,然后在一个单一智能体的强化学习中使用深度Q学习(DQN)来学习如何适应变化的环境。我们进行了一系列仿真实验来比较可适应信任者与仅使用一种模型(FIRE或CA)的信托者的性能,结果表明,可适应的代理确实可以在动态环境中学习何时使用每种模型,从而在环境中表现一致。
https://arxiv.org/abs/2404.18296
Emerging data-driven approaches, such as deep reinforcement learning (DRL), aim at on-the-field learning of powertrain control policies that optimize fuel economy and other performance metrics. Indeed, they have shown great potential in this regard for individual vehicles on specific routes or drive cycles. However, for fleets of vehicles that must service a distribution of routes, DRL approaches struggle with learning stability issues that result in high variances and challenge their practical deployment. In this paper, we present a novel framework for shared learning among a fleet of vehicles through the use of a distilled group policy as the knowledge sharing mechanism for the policy learning computations at each vehicle. We detail the mathematical formulation that makes this possible. Several scenarios are considered to analyze the functionality, performance, and computational scalability of the framework with fleet size. Comparisons of the cumulative performance of fleets using our proposed shared learning approach with a baseline of individual learning agents and another state-of-the-art approach with a centralized learner show clear advantages to our approach. For example, we find a fleet average asymptotic improvement of 8.5 percent in fuel economy compared to the baseline while also improving on the metrics of acceleration error and shifting frequency for fleets serving a distribution of suburban routes. Furthermore, we include demonstrative results that show how the framework reduces variance within a fleet and also how it helps individual agents adapt better to new routes.
新兴的数据驱动方法,如深度强化学习(DRL),旨在在实况中学习动力电池控制策略,以优化燃料经济和其他性能指标。事实上,它们在个别车辆或特定的路线/驾驶周期方面已经表现出巨大的潜力。然而,对于必须为分布路线服务的车队,DRL方法在应对导致高方差的学习稳定性问题方面遇到困难,这使得它们的实际部署受到挑战。在本文中,我们提出了一个新颖的框架,通过在每辆车之间共享学习,实现车队内车辆之间的知识共享,以进行政策学习计算。我们详细介绍了实现这一目标的数学公式。在分析框架的功能、性能和计算可扩展性方面考虑了几种情景。与单独学习代理的基线和另一个最先进的集中学习方法进行比较,我们的共享学习方法展示了明显的优势。例如,我们发现在燃料经济方面,与基线相比,车队平均增益达到8.5%。同时,还改善了为郊区路线服务的车队的加速度误差和转移频率指标。此外,我们还包括了一些示例结果,展示了框架如何减少车队内的方差,以及如何帮助个体代理更好地适应新路线。
https://arxiv.org/abs/2404.17892
In recent years, multi-agent reinforcement learning algorithms have made significant advancements in diverse gaming environments, leading to increased interest in the broader application of such techniques. To address the prevalent challenge of partial observability, communication-based algorithms have improved cooperative performance through the sharing of numerical embedding between agents. However, the understanding of the formation of collaborative mechanisms is still very limited, making designing a human-understandable communication mechanism a valuable problem to address. In this paper, we propose a novel multi-agent reinforcement learning algorithm that embeds large language models into agents, endowing them with the ability to generate human-understandable verbal communication. The entire framework has a message module and an action module. The message module is responsible for generating and sending verbal messages to other agents, effectively enhancing information sharing among agents. To further enhance the message module, we employ a teacher model to generate message labels from the global view and update the student model through Supervised Fine-Tuning (SFT). The action module receives messages from other agents and selects actions based on current local observations and received messages. Experiments conducted on the Overcooked game demonstrate our method significantly enhances the learning efficiency and performance of existing methods, while also providing an interpretable tool for humans to understand the process of multi-agent cooperation.
近年来,多智能体强化学习算法在各种游戏环境中取得了显著的进步,导致人们对这些技术的更广泛应用产生了浓厚兴趣。为了应对普遍存在的部分可观察性挑战,基于通信的算法通过共享代理之间的数值嵌入来提高合作性能。然而,对协作机制的理解仍然非常有限,使得设计一个可被人类理解的信息传输机制成为一个有价值的问题。在本文中,我们提出了一个新颖的多智能体强化学习算法,将大型语言模型嵌入到代理中,使它们具有生成人类可理解口头交流的能力。整个框架包括消息模块和动作模块。消息模块负责生成和发送口头信息给其他代理,有效提高了代理之间的信息共享。为了进一步提高消息模块,我们使用教师模型从全局视角生成消息标签,并通过监督微调(SFT)更新学生模型。在Overcooked游戏中进行实验证明,我们的方法显著增强了现有方法的学习效率和性能,并为人类理解多智能体合作的过程提供了一个可解释的工具。
https://arxiv.org/abs/2404.17780
Reinforcement Learning (RL) provides a framework in which agents can be trained, via trial and error, to solve complex decision-making problems. Learning with little supervision causes RL methods to require large amounts of data, which renders them too expensive for many applications (e.g. robotics). By reusing knowledge from a different task, knowledge transfer methods present an alternative to reduce the training time in RL. Given how severe data scarcity can be, there has been a growing interest for methods capable of transferring knowledge across different domains (i.e. problems with different representation) due to the flexibility they offer. This review presents a unifying analysis of methods focused on transferring knowledge across different domains. Through a taxonomy based on a transfer-approach categorization, and a characterization of works based on their data-assumption requirements, the objectives of this article are to 1) provide a comprehensive and systematic revision of knowledge transfer methods for the cross-domain RL setting, 2) categorize and characterize these methods to provide an analysis based on relevant features such as their transfer approach and data requirements, and 3) discuss the main challenges regarding cross-domain knowledge transfer, as well as ideas of future directions worth exploring to address these problems.
强化学习(RL)提供了一个框架,让智能体通过尝试和错误,训练来解决复杂的决策问题。少量的监督学习导致RL方法需要大量数据,这使得它们对许多应用(如机器人学)来说过于昂贵。通过将不同任务中的知识进行重用,知识迁移方法提供了一种减少RL培训时间的方法。由于数据稀缺的严重程度,人们对能够在不同领域之间转移知识的方法产生了浓厚兴趣,因为它们提供了灵活性。本文对关注跨领域知识传递的方法进行了统一分析。通过基于迁移方法分类的树状结构,以及根据数据假设要求对作品进行特征描述,本文的目的是1)提供一个全面的关于跨领域RL设置中的知识传递方法的全面和系统的回顾,2)对这类方法进行分类和定性,以便根据其迁移方法和数据需求提供有关其相关特征的分析,3)讨论跨领域知识传递的主要挑战以及值得探索的未来方向来解决这些问题。
https://arxiv.org/abs/2404.17687
Furniture assembly remains an unsolved problem in robotic manipulation due to its long task horizon and nongeneralizable operations plan. This paper presents the Tactile Ensemble Skill Transfer (TEST) framework, a pioneering offline reinforcement learning (RL) approach that incorporates tactile feedback in the control loop. TEST's core design is to learn a skill transition model for high-level planning, along with a set of adaptive intra-skill goal-reaching policies. Such design aims to solve the robotic furniture assembly problem in a more generalizable way, facilitating seamless chaining of skills for this long-horizon task. We first sample demonstration from a set of heuristic policies and trajectories consisting of a set of randomized sub-skill segments, enabling the acquisition of rich robot trajectories that capture skill stages, robot states, visual indicators, and crucially, tactile signals. Leveraging these trajectories, our offline RL method discerns skill termination conditions and coordinates skill transitions. Our evaluations highlight the proficiency of TEST on the in-distribution furniture assemblies, its adaptability to unseen furniture configurations, and its robustness against visual disturbances. Ablation studies further accentuate the pivotal role of two algorithmic components: the skill transition model and tactile ensemble policies. Results indicate that TEST can achieve a success rate of 90\% and is over 4 times more efficient than the heuristic policy in both in-distribution and generalization settings, suggesting a scalable skill transfer approach for contact-rich manipulation.
家具组装仍然是一个在机器人操作中未解决的难题,由于其长远的任务范围和无法扩展的操作计划。本文介绍了一种首创的在线强化学习(RL)方法:Tactile Ensemble Skill Transfer(TEST)框架,该框架在控制环中包含了触觉反馈。TEST的核心设计是学习高级规划技能转移模型以及一系列自适应的技能内化目标达成策略。这样的设计旨在以更通用的方式解决机器人家具组装问题,从而使技能的传承更加顺畅。 首先,我们从一系列启发式策略和轨迹中采样演示,包括一系列随机的子技能段,从而使机器人获得丰富的轨迹,捕捉技能阶段、机器人状态、视觉指示和关键的是,触觉信号。利用这些轨迹,我们的离线RL方法可以辨别技能终止条件并协调技能转换。我们的评估显示,TEST在离散家具组装方面表现出卓越的性能,其对未见过的家具配置的适应性,以及对抗视觉干扰的鲁棒性。消融研究进一步强调了两个算法组件的重要性:技能转移模型和触觉集成策略。结果表明,TEST可以在离散和泛化设置中获得90%的成功率,而在分布和通用设置中,其效率是启发式策略的4倍以上,表明了为接触密集操作实现可扩展技能转移的方法。
https://arxiv.org/abs/2404.17684
Automating the segregation process is a need for every sector experiencing a high volume of materials handling, repetitive and exhaustive operations, in addition to risky exposures. Learning automated pick-and-place operations can be efficiently done by introducing collaborative autonomous systems (e.g. manipulators) in the workplace and among human operators. In this paper, we propose a deep reinforcement learning strategy to learn the place task of multi-categorical items from a shared workspace between dual-manipulators and to multi-goal destinations, assuming the pick has been already completed. The learning strategy leverages first a stochastic actor-critic framework to train an agent's policy network, and second, a dynamic 3D Gym environment where both static and dynamic obstacles (e.g. human factors and robot mate) constitute the state space of a Markov decision process. Learning is conducted in a Gazebo simulator and experiments show an increase in cumulative reward function for the agent further away from human factors. Future investigations will be conducted to enhance the task performance for both agents simultaneously.
自动化分离过程是每个经历大量材料操作、重复且繁琐的操作,以及高风险暴露的行业的需要。通过在职场和人类操作员之间引入协作自主系统(例如操作器),可以高效地学习自动化选择和放置任务,从双臂操作器共享工作空间的多分类项目中学到。在本文中,我们提出了一个深度强化学习策略,从双臂操作器与多目标之间共享工作空间中学习多分类项目的放置位置,假设选择已经完成。学习策略利用随机演员-评论框架训练代理程序的策略网络,并利用动态3D Gym环境,其中静态和动态障碍(例如人为因素和机器人伴侣)构成一个马尔可夫决策过程的状态空间。学习是通过Gazebo仿真器进行的,实验结果表明,代理远离人类因素时,累积奖励函数增加。未来研究将同时对两个代理进行任务性能的提升。
https://arxiv.org/abs/2404.17673
Precise object manipulation and placement is a common problem for household robots, surgery robots, and robots working on in-situ construction. Prior work using computer vision, depth sensors, and reinforcement learning lacks the ability to reactively recover from planning errors, execution errors, or sensor noise. This work introduces a method that uses force-torque sensing to robustly place objects in stable poses, even in adversarial environments. On 46 trials, our method finds success rates of 100% for basic stacking, and 17% for cases requiring adjustment.
精确的对象操作和放置是家庭机器人、手术机器人和现场施工机器人中一个常见的问题。之前使用计算机视觉、深度传感器和强化学习的工作,缺乏应对计划错误、执行错误或传感器噪音的反响能力。本文介绍了一种利用力-力矩传感器在稳定姿势中放置物体的方法,即使在具有敌意环境的情况下也是如此。在46次试验中,我们的方法在基本堆叠上的成功率为100%,在需要调整的情况下成功率为17%。
https://arxiv.org/abs/2404.17668
Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full sequence. In this work, we leverage the rich toolkit of Sequential Monte Carlo (SMC) for these probabilistic inference problems. In particular, we use learned twist functions to estimate the expected future value of the potential at each timestep, which enables us to focus inference-time computation on promising partial sequences. We propose a novel contrastive method for learning the twist functions, and establish connections with the rich literature of soft reinforcement learning. As a complementary application of our twisted SMC framework, we present methods for evaluating the accuracy of language model inference techniques using novel bidirectional SMC bounds on the log partition function. These bounds can be used to estimate the KL divergence between the inference and target distributions in both directions. We apply our inference evaluation techniques to show that twisted SMC is effective for sampling undesirable outputs from a pretrained model (a useful component of harmlessness training and automated red-teaming), generating reviews with varied sentiment, and performing infilling tasks.
大语言模型(LLMs)具有许多能力和安全技术,包括强化学习(RLHF)、自动红色代理、提示工程和填充,可以将其视为对给定奖励或潜在函数定义的规范化目标分布的采样。在这项工作中,我们利用Sequential Monte Carlo(SMC)的丰富工具箱解决这些概率推理问题。特别是,我们使用学习来的扭曲函数来估计每个时间步的潜在价值的期望,这使得我们在推理时间内专注于有前景的局部序列。我们提出了一个新颖的对比学习方法来学习扭曲函数,并建立了与软强化学习丰富文献的联系。作为我们扭曲SMC框架的补充应用,我们提出了使用新颖的双向SMC边界来评估语言模型推理技术准确性的方法。这些边界可用于在两个方向上估计推理和目标分布之间的KL散度。我们将推理评估技术应用于表明,扭曲SMC对于从预训练模型( harmlessness培训和自动红色代理的有用组件)中采样不良输出(有用的训练和自动红色代理的一个有用组件)和生成带有不同情感的评论以及执行填充任务非常有效。
https://arxiv.org/abs/2404.17546
Brain-Computer Interfaces (BCIs) rely on accurately decoding electroencephalography (EEG) motor imagery (MI) signals for effective device control. Graph Neural Networks (GNNs) outperform Convolutional Neural Networks (CNNs) in this regard, by leveraging the spatial relationships between EEG electrodes through adjacency matrices. The EEG_GLT-Net framework, featuring the state-of-the-art EEG_GLT adjacency matrix method, has notably enhanced EEG MI signal classification, evidenced by an average accuracy of 83.95% across 20 subjects on the PhysioNet dataset. This significantly exceeds the 76.10% accuracy rate achieved using the Pearson Correlation Coefficient (PCC) method within the same framework. In this research, we advance the field by applying a Reinforcement Learning (RL) approach to the classification of EEG MI signals. Our innovative method empowers the RL agent, enabling not only the classification of EEG MI data points with higher accuracy, but effective identification of EEG MI data points that are less distinct. We present the EEG_RL-Net, an enhancement of the EEG_GLT-Net framework, which incorporates the trained EEG GCN Block from EEG_GLT-Net at an adjacency matrix density of 13.39% alongside the RL-centric Dueling Deep Q Network (Dueling DQN) block. The EEG_RL-Net model showcases exceptional classification performance, achieving an unprecedented average accuracy of 96.40% across 20 subjects within 25 milliseconds. This model illustrates the transformative effect of the RL in EEG MI time point classification.
脑-计算机界面(BCIs)依赖于准确解码脑电图(EEG)运动想象(MI)信号来实现有效的设备控制。图神经网络(GNNs)在这方面表现出比卷积神经网络(CNNs)更好的性能,通过利用EEG电极之间的邻接矩阵关系。特别是,基于最先进的EEG_GLT邻接矩阵方法的EEG_GLT-Net框架在 PhysioNet 数据集上的平均准确率达到了83.95%,远高于使用皮尔逊相关系数(PCC)方法达到的76.10%。在这项研究中,我们通过应用强化学习(RL)方法来对EEG MI信号进行分类。我们创新的方法使RL代理不仅能够高精度地分类EEG MI数据点,而且能够有效识别那些不太明显的EEG MI数据点。我们提出了EEG_RL-Net,是EEG_GLT-Net框架的增强版,在距离为13.39%的邻接矩阵密度下加入了训练后的EEG GCN块,并添加了以RL为中心的双对深深Q网络(Dueling DQN)块。EEG_RL-Net模型展示了出色的分类性能,在20个受试者上的平均准确率达到了96.40%,在25毫秒内实现了前所未有的平均准确率。这个模型突出了RL在EEG MI时间点分类中的 transformative 作用。
https://arxiv.org/abs/2405.00723
In order to solve the problem of frequent deceleration of unmanned vehicles when approaching obstacles, this article uses a Deep Q-Network (DQN) and its extension, the Double Deep Q-Network (DDQN), to develop a local navigation system that adapts to obstacles while maintaining optimal speed planning. By integrating improved reward functions and obstacle angle determination methods, the system demonstrates significant enhancements in maneuvering capabilities without frequent decelerations. Experiments conducted in simulated environments with varying obstacle densities confirm the effectiveness of the proposed method in achieving more stable and efficient path planning.
为了解决无人机在接近障碍物时频繁减速的问题,本文使用深度 Q 网络(DQN)及其扩展,双深度 Q 网络(DDQN),开发了一种局部导航系统,能够在保持最优速度规划的同时适应障碍物。通过整合改进的奖励函数和障碍物角度确定方法,系统在不需要频繁减速的情况下显示出显著的操纵能力提升。在具有不同障碍物密度的模拟环境中进行的实验证实了所提出方法在实现更稳定和高效的路径规划方面的有效性。
https://arxiv.org/abs/2404.17379
Despite the success of large language models (LLMs) in natural language generation, much evidence shows that LLMs may produce incorrect or nonsensical text. This limitation highlights the importance of discerning when to trust LLMs, especially in safety-critical domains. Existing methods, which rely on verbalizing confidence to tell the reliability by inducing top-k responses and sampling-aggregating multiple responses, often fail, due to the lack of objective guidance of confidence. To address this, we propose CONfidence-Quality-ORDerpreserving alignment approach (CONQORD), leveraging reinforcement learning with a tailored dual-component reward function. This function encompasses quality reward and orderpreserving alignment reward functions. Specifically, the order-preserving reward incentivizes the model to verbalize greater confidence for responses of higher quality to align the order of confidence and quality. Experiments demonstrate that our CONQORD significantly improves the alignment performance between confidence levels and response accuracy, without causing the model to become over-cautious. Furthermore, the aligned confidence provided by CONQORD informs when to trust LLMs, and acts as a determinant for initiating the retrieval process of external knowledge. Aligning confidence with response quality ensures more transparent and reliable responses, providing better trustworthiness.
尽管大型语言模型(LLMs)在自然语言生成方面的成功,但大量证据表明,LLMs可能会产生不正确或不合逻辑的文本。这种局限性突出了在关键领域(如安全领域)中分辨何时信任LLMs的重要性。现有的方法,通过通过诱导top-k响应来告知可靠性来获得置信度,以及采样聚合多个响应,常常会失败,因为置信度的客观指导不足。为了解决这个问题,我们提出了CONfidence-Quality-ORDerpreserving alignment approach(CONQORD),利用带有自定义双组件奖励函数的强化学习。该函数包括质量和顺序保持的置信度奖励函数。具体来说,顺序保持奖励激励模型对高质量响应给出更大的置信度,以对置信度和质量的顺序进行对齐。实验证明,我们的CONQORD显著提高了置信水平与响应准确性的对齐性能,而不会使模型变得过于谨慎。此外,由CONQORD提供的对齐置信告知何时信任LLMs,并作为外部知识检索过程的决断因素。将置信度与响应质量对齐确保了更透明和可靠的回应,提高了可信度。
https://arxiv.org/abs/2404.17287
Autonomous Unmanned Aerial Vehicles (UAVs) have become essential tools in defense, law enforcement, disaster response, and product delivery. These autonomous navigation systems require a wireless communication network, and of late are deep learning based. In critical scenarios such as border protection or disaster response, ensuring the secure navigation of autonomous UAVs is paramount. But, these autonomous UAVs are susceptible to adversarial attacks through the communication network or the deep learning models - eavesdropping / man-in-the-middle / membership inference / reconstruction. To address this susceptibility, we propose an innovative approach that combines Reinforcement Learning (RL) and Fully Homomorphic Encryption (FHE) for secure autonomous UAV navigation. This end-to-end secure framework is designed for real-time video feeds captured by UAV cameras and utilizes FHE to perform inference on encrypted input images. While FHE allows computations on encrypted data, certain computational operators are yet to be implemented. Convolutional neural networks, fully connected neural networks, activation functions and OpenAI Gym Library are meticulously adapted to the FHE domain to enable encrypted data processing. We demonstrate the efficacy of our proposed approach through extensive experimentation. Our proposed approach ensures security and privacy in autonomous UAV navigation with negligible loss in performance.
自主无人机(UAVs)已成为军事、执法、灾难应对和产品交付等领域的不可或缺的工具。这些自主导航系统需要无线通信网络,并且最近基于深度学习。在危机场景(如边境保护或灾难应对)中,确保自主UAV的安全导航至关重要。但是,这些自主UAV通过通信网络或深度学习模型容易受到攻击 - 窃听 / 中间人攻击 / 成员推断 / 重建。为了应对这种易受攻击性,我们提出了结合强化学习(RL)和完全同态加密(FHE)的安全自主UAV导航的创新方法。这个端到端的安全框架是为由UAV相机捕获的实时视频信号设计的,并利用FHE对加密输入图像进行推理。虽然FHE允许对加密数据进行计算,但某些计算操作尚未实现。将卷积神经网络、全连接神经网络、激活函数和OpenAI Gym库细粒度地适应FHE领域,以实现加密数据处理。我们通过广泛的实验来证明我们提出的方法的有效性。我们提出的方法确保了在自主UAV导航中保护安全和隐私,同时性能损失非常小。
https://arxiv.org/abs/2404.17225
Bipedal robots are garnering increasing global attention due to their potential applications and advancements in artificial intelligence, particularly in Deep Reinforcement Learning (DRL). While DRL has driven significant progress in bipedal locomotion, developing a comprehensive and unified framework capable of adeptly performing a wide range of tasks remains a challenge. This survey systematically categorizes, compares, and summarizes existing DRL frameworks for bipedal locomotion, organizing them into end-to-end and hierarchical control schemes. End-to-end frameworks are assessed based on their learning approaches, whereas hierarchical frameworks are dissected into layers that utilize either learning-based methods or traditional model-based approaches. This survey provides a detailed analysis of the composition, capabilities, strengths, and limitations of each framework type. Furthermore, we identify critical research gaps and propose future directions aimed at achieving a more integrated and efficient framework for bipedal locomotion, with potential broad applications in everyday life.
双向行走机器人因其在人工智能的应用和进步而引起了全球关注。特别是深度强化学习(DRL),它们在行走方面的进展尤为明显。虽然DRL在行走方面取得了显著的进展,但开发一个全面统一框架,能巧妙地执行各种任务仍然具有挑战性。本调查系统地分类、比较并总结了现有的DRL框架,将它们组织成端到端的控制方案。基于学习方法的端到端框架是根据它们的学习方式进行评估的,而基于传统模型的层次框架则是通过利用学习方法或传统模型基方法来分割的。本调查详细分析了每种框架的构成、功能、优缺点和局限性。此外,我们识别出关键的研究空白并提出了旨在实现更集成和高效的下肢行走框架的未来方向,该框架在日常生活活动中具有广泛的应用潜力。
https://arxiv.org/abs/2404.17070
The objective of this work is to evaluate multi-agent artificial intelligence methods when deployed on teams of unmanned surface vehicles (USV) in an adversarial environment. Autonomous agents were evaluated in real-world scenarios using the Aquaticus test-bed, which is a Capture-the-Flag (CTF) style competition involving teams of USV systems. Cooperative teaming algorithms of various foundations in behavior-based optimization and deep reinforcement learning (RL) were deployed on these USV systems in two versus two teams and tested against each other during a competition period in the fall of 2023. Deep reinforcement learning applied to USV agents was achieved via the Pyquaticus test bed, a lightweight gymnasium environment that allows simulated CTF training in a low-level environment. The results of the experiment demonstrate that rule-based cooperation for behavior-based agents outperformed those trained in Deep-reinforcement learning paradigms as implemented in these competitions. Further integration of the Pyquaticus gymnasium environment for RL with MOOS-IvP in terms of configuration and control schema will allow for more competitive CTF games in future studies. As the development of experimental deep RL methods continues, the authors expect that the competitive gap between behavior-based autonomy and deep RL will be reduced. As such, this report outlines the overall competition, methods, and results with an emphasis on future works such as reward shaping and sim-to-real methodologies and extending rule-based cooperation among agents to react to safety and security events in accordance with human experts intent/rules for executing safety and security processes.
本文旨在评估在无人水面车辆(USV)团队中部署多智能体人工智能方法的效果。在2023年秋季的数据竞赛期间,使用Aquaticus测试平台对各种基于行为的优化和深度强化学习(RL)基础的协作算法进行了评估,这些算法被部署在这些USV系统上,以实现双人或双队合作。应用到USV上的深度强化学习通过轻量级的Pyquaticus测试平台实现,这是一个在低级环境中模拟CTF训练的轻量级体育馆环境。实验结果表明,基于行为的智能体代理的合作规则超过了这些竞赛中采用深度强化学习范式的训练结果。进一步研究Pyquaticus体育馆环境和MOOS-IvP之间的配置和控制方案,将在未来的研究中实现更具有竞争力的CTF游戏。随着实验性深度强化学习方法的不断发展,作者预计,基于行为的自主和深度强化学习之间的竞争差距将减少。因此,本报告概述了整个比赛、方法和结果,重点关注未来的研究,例如奖励塑造和模拟-到-现实方法,以及扩展智能体代理之间的规则合作以根据人类专家意图/规则执行安全和安全过程。
https://arxiv.org/abs/2404.17038