In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.
在人工智能领域,确保大型语言模型(LLMs)的安全决策是一个重要的挑战。本文介绍了治理共享资源模拟(GovSim)模拟平台,该平台旨在研究LLMs的战略互动和合作决策。通过这个仿真环境,我们探讨了AI代理之间资源共享的动态,强调了道德考虑、战略规划和谈判技能的重要性。GovSim具有灵活性,支持任何基于文本的代理,包括LLM代理。利用生成代理框架,我们创建了一个标准代理,促进不同LLM的集成。我们的研究结果表明,在GovSim中,只有两个 out of 15 经测试的LLM成功地实现了可持续的结果,表明模型在管理共享资源方面的能力存在显著的差距。此外,我们发现,通过移除代理与进行沟通的能力,它们超出了共享资源的使用,强调了沟通在合作中的重要性。有趣的是,大多数LLM都缺乏普遍化假设的能力,这表明它们在推理能力方面存在显著的弱点。我们开源了我们所有研究的完整套件,包括仿真环境、代理提示和综合网页界面。
https://arxiv.org/abs/2404.16698
While Poker, as a family of games, has been studied extensively in the last decades, collectible card games have seen relatively little attention. Only recently have we seen an agent that can compete with professional human players in Hearthstone, one of the most popular collectible card games. Although artificial agents must be able to work with imperfect information in both of these genres, collectible card games pose another set of distinct challenges. Unlike in many poker variants, agents must deal with state space so vast that even enumerating all states consistent with the agent's beliefs is intractable, rendering the current search methods unusable and requiring the agents to opt for other techniques. In this paper, we investigate the strength of such techniques for this class of games. Namely, we present preliminary analysis results of ByteRL, the state-of-the-art agent in Legends of Code and Magic and Hearthstone. Although ByteRL beat a top-10 Hearthstone player from China, we show that its play in Legends of Code and Magic is highly exploitable.
尽管扑克作为一家人口游戏已经深入研究了几十年,收藏卡牌游戏却受到了相对较少的关注。直到最近,我们才看到了一个可以与职业人类玩家在《英雄联盟:魔兽世界》等最受欢迎的收藏卡牌游戏中竞技的代理。尽管人工智能代理必须能够处理这两类游戏中不完美的信息,但收藏卡牌游戏又提出了另一组独特的挑战。与许多扑克变体不同,代理必须处理状态空间如此之广,以至于连列出所有与代理信念一致的状态都是不可行的,使得当前的搜索方法无法使用,并要求代理选择其他技术。在本文中,我们研究了这类游戏中所使用的这些技术的强度。具体来说,我们介绍了《英雄联盟:魔兽世界》和《英雄联盟:激战峡谷》中代表最先进水平的代理ByteRL的初步分析结果。尽管ByteRL在击败中国排名前十的《英雄联盟:魔兽世界》玩家方面表现出色,但我们表明,它在《英雄联盟:激战峡谷》中的表现具有高度的被 exploitation 性。
https://arxiv.org/abs/2404.16689
Autonomous navigation in dynamic environments is a complex but essential task for autonomous robots, with recent deep reinforcement learning approaches showing promising results. However, the complexity of the real world makes it infeasible to train agents in every possible scenario configuration. Moreover, existing methods typically overlook factors such as robot kinodynamic constraints, or assume perfect knowledge of the environment. In this work, we present RUMOR, a novel planner for differential-drive robots that uses deep reinforcement learning to navigate in highly dynamic environments. Unlike other end-to-end DRL planners, it uses a descriptive robocentric velocity space model to extract the dynamic environment information, enhancing training effectiveness and scenario interpretation. Additionally, we propose an action space that inherently considers robot kinodynamics and train it in a simulator that reproduces the real world problematic aspects, reducing the gap between the reality and simulation. We extensively compare RUMOR with other state-of-the-art approaches, demonstrating a better performance, and provide a detailed analysis of the results. Finally, we validate RUMOR's performance in real-world settings by deploying it on a ground robot. Our experiments, conducted in crowded scenarios and unseen environments, confirm the algorithm's robustness and transferability.
自主导航在动态环境中是一个复杂但 essential 的任务,对于自主机器人来说,最近的深度强化学习方法显示出良好的效果。然而,真实世界的复杂性使得在所有可能的场景配置上训练代理是不切实际的。此外,现有的方法通常忽视诸如机器人动力学限制等因素,或者假设对环境具有完美的了解。在这项工作中,我们提出了 RUMOR,一种用于在高度动态环境中进行自主导航的新规划器,它使用深度强化学习来 navigate。与其它端到端 DRL 规划器不同,它使用描述性的机器人本体运动空间模型来提取动态环境信息,提高训练效果和场景解释。此外,我们还提出了一个考虑机器人动力学的动作空间,并将其在模拟器上训练,减少了现实世界和模拟器之间的差距。我们详细比较了 RUMOR 与其他最先进的方案,证明了其更好的性能,并提供了结果的详细分析。最后,我们通过在真实环境中部署 RUMOR 来验证其性能。我们的实验在拥挤的场景和未知的环境中进行,证实了算法的稳健性和可迁移性。
https://arxiv.org/abs/2404.16672
Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness. Our source code is publicly available at this https URL.
开发移动设备上的自主代理可以显著增强用户交互,通过提供更高的效率和可访问性。然而,尽管移动设备控制代理越来越受到关注,但缺乏一个普遍适用的基准使得衡量这一领域科学进展具有挑战性。在这项工作中,我们介绍了B-MoCA:一个专门为评估移动设备控制代理而设计的新的基准。为了创建一个真实的基准,我们基于Android操作系统开发B-MoCA,并定义了60个常见的日常任务。重要的是,我们引入了随机化功能,随机改变移动设备的各个方面,包括用户界面布局和语言设置,以评估泛化性能。我们基准了各种代理,包括使用大型语言模型(LLMs)或多模态LLM训练的代理以及使用人类专家演示训练的代理。虽然这些代理在执行简单任务时表现出熟练,但他们在复杂任务上的表现却显露出未来研究可以改进其有效性的巨大潜力。我们的源代码可在此链接公开使用。
https://arxiv.org/abs/2404.16660
Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.
保持时间稳定性对于多智能体轨迹预测至关重要。通常,保持这种稳定性需要足够的正则化来保持,否则会导致运动状态的波动,从而导致预测的不一致性和误差放大的问题。在这项研究中,我们引入了一个名为多智能体轨迹预测通过神经相互作用能量(MATE)的框架。这个框架通过使用神经相互作用能量来评估智能体的相互作用运动,捕捉了互动的动态,并突出了它们对智能体未来轨迹的影响。为了加强时间稳定性,我们引入了两个约束:智能体间交互约束和智能体间运动约束。这些约束共同作用,确保了系统和智能体层面的时间稳定性,有效地减轻了多智能体系统中的预测波动。与之前的方法相比,在四个不同的数据集上的比较评估结果表明,我们模型的预测准确性和泛化能力都具有优势。
https://arxiv.org/abs/2404.16579
A central question for cognitive science is to understand how humans process visual objects, i.e, to uncover human low-dimensional concept representation space from high-dimensional visual stimuli. Generating visual stimuli with controlling concepts is the key. However, there are currently no generative models in AI to solve this problem. Here, we present the Concept based Controllable Generation (CoCoG) framework. CoCoG consists of two components, a simple yet efficient AI agent for extracting interpretable concept and predicting human decision-making in visual similarity judgment tasks, and a conditional generation model for generating visual stimuli given the concepts. We quantify the performance of CoCoG from two aspects, the human behavior prediction accuracy and the controllable generation ability. The experiments with CoCoG indicate that 1) the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07\% accuracy in the THINGS-similarity dataset; 2) CoCoG can generate diverse objects through the control of concepts; 3) CoCoG can manipulate human similarity judgment behavior by intervening key concepts. CoCoG offers visual objects with controlling concepts to advance our understanding of causality in human cognition. The code of CoCoG is available at \url{this https URL}.
cognitive科学的一个核心问题是了解人类如何处理视觉对象,即揭示人类低维度概念表示空间从高维视觉刺激中。生成具有控制概念的视觉刺激是关键。然而,目前还没有人工智能中的生成模型来解决这个问题。在这里,我们提出了基于概念的控制性生成(CoCoG)框架。CoCoG由两个组件组成,一个是简单而高效的AI代理,用于提取可解释的概念和在视觉相似性判断任务中预测人类决策,另一个是条件生成模型,用于根据概念生成视觉刺激。我们从两个方面评估CoCoG的表现,即人类行为预测准确性和可控制性生成能力。CoCoG与CoCoG的实验表明,1)CoCoG中的可靠概念嵌入允许在THINGS-similarity数据集中预测人类行为达到64.07%;2)通过控制概念,CoCoG可以生成多样化的对象;3)通过干预关键概念,CoCoG可以操纵人类相似性判断行为。CoCoG为我们在人类认知中的因果关系提供了具有控制概念的视觉对象。CoCoG代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.16482
Multi-agent pathfinding (MAPF) is the problem of finding a set of conflict-free paths for a set of agents. Typically, the agents' moves are limited to a pre-defined graph of possible locations and allowed transitions between them, e.g. a 4-neighborhood grid. We explore how to solve MAPF problems when each agent can move between any pair of possible locations as long as traversing the line segment connecting them does not lead to the collision with the obstacles. This is known as any-angle pathfinding. We present the first optimal any-angle multi-agent pathfinding algorithm. Our planner is based on the Continuous Conflict-based Search (CCBS) algorithm and an optimal any-angle variant of the Safe Interval Path Planning (TO-AA-SIPP). The straightforward combination of those, however, scales poorly since any-angle path finding induces search trees with a very large branching factor. To mitigate this, we adapt two techniques from classical MAPF to the any-angle setting, namely Disjoint Splitting and Multi-Constraints. Experimental results on different combinations of these techniques show they enable solving over 30% more problems than the vanilla combination of CCBS and TO-AA-SIPP. In addition, we present a bounded-suboptimal variant of our algorithm, that enables trading runtime for solution cost in a controlled manner.
多智能体路径搜索(MAPF)是找到一组智能体的一组无冲突路径的问题。通常,智能体的移动范围受到预定义的图的约束,例如一个4邻域网格。我们研究了当每个智能体可以在任意两个可能的位置之间移动时如何解决MAPF问题,只要穿过连接它们的线段不导致障碍物碰撞。这被称为任意角路径搜索。我们提出了第一个最优任意角多智能体路径搜索算法。我们的规划基于连续冲突基于搜索(CCBS)算法和最优任意角安全间隔路径规划(TO-AA-SIPP)的优化变体。然而,这种直接的组合表现不佳,因为任意角路径搜索导致具有非常大分支因子的搜索树。为了减轻这种影响,我们将两个经典MAPF技术适应到任意角设置,即离散分裂和多约束。在不同的技术组合的实验结果表明,它们能够解决超过CCBS和TO-AA-SIPP的30%多的问题。此外,我们还提出了一个有界最优的算法变体,它能够以可控的方式 trading 运行时间与解决方案成本。
https://arxiv.org/abs/2404.16379
Foundation models contain a wealth of information from their vast number of training samples. However, most prior arts fail to extract this information in a precise and efficient way for small sample sizes. In this work, we propose a framework utilizing reinforcement learning as a control for foundation models, allowing for the granular generation of small, focused synthetic support sets to augment the performance of neural network models on real data classification tasks. We first allow a reinforcement learning agent access to a novel context based dictionary; the agent then uses this dictionary with a novel prompt structure to form and optimize prompts as inputs to generative models, receiving feedback based on a reward function combining the change in validation accuracy and entropy. A support set is formed this way over several exploration steps. Our framework produced excellent results, increasing classification accuracy by significant margins for no additional labelling or data cost.
基础模型包含大量训练样本中所学到的丰富信息。然而,大多数先前的艺术作品在小型样本量的情况下无法精确有效地提取这些信息。在本文中,我们提出了一种利用强化学习作为基础模型控制的方法,允许在小型、关注点状的合成支持集上生成细粒度的支持集,以提高神经网络模型在真实数据分类任务上的性能。我们首先允许一个强化学习代理访问一个新颖的上下文基词表;然后,代理使用此基词表与新颖的提示结构形成和优化提示作为输入,根据基于验证准确性和熵的奖励函数接收反馈。通过几次探索步骤,这样就可以形成一个支持集。我们的框架产生了很好的结果,在不需要额外标签或数据成本的情况下,将分类准确度提高了显著的幅度。
https://arxiv.org/abs/2404.16300
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to intelligently select acoustic data sampling locations. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment in which a mobile agent equipped with visual and acoustic sensors jointly constructs the environment acoustic model and the occupancy map on-the-fly. We introduce ActiveRIR, a reinforcement learning (RL) policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions, yielding a high quality acoustic model of the environment from a minimal set of acoustic samples. We train our policy with a novel RL reward based on information gain in the environment acoustic model. Evaluating on diverse unseen indoor environments from a state-of-the-art acoustic simulation platform, ActiveRIR outperforms an array of methods--both traditional navigation agents based on spatial novelty and visual exploration as well as existing state-of-the-art methods.
环境声学模型表示了室内环境声音如何通过物理特性进行转换,对于任何给定的源/接收器位置。传统构建声学模型的方法包括在空间密集的位置收集大量声学数据,或依赖于场景几何知识来智能选择声学数据采样位置。我们提出了一种主动声学采样,这是一种新任务,用于高效构建未映射环境中的环境声学模型和占用图,其中移动代理器配备视觉和听觉传感器共同构建环境声学模型和占用图。我们引入了ActiveRIR,一种基于音频-视觉传感器流的信息增强强化学习(RL)策略,用于引导代理器导航并确定最优的声学数据采样位置,从而从最小数量的声学样本中产生高质量的环境声学模型。我们用基于环境声学模型信息增益的新颖RL奖励来训练我们的策略。在从最先进的声学仿真平台上的多样未见室内环境中评估,ActiveRIR表现优于基于空间新颖性和视觉探索的传统导航方法和现有最先进的方法。
https://arxiv.org/abs/2404.16216
Consider an agent acting to achieve its temporal goal, but with a "trembling hand". In this case, the agent may mistakenly instruct, with a certain (typically small) probability, actions that are not intended due to faults or imprecision in its action selection mechanism, thereby leading to possible goal failure. We study the trembling-hand problem in the context of reasoning about actions and planning for temporally extended goals expressed in Linear Temporal Logic on finite traces (LTLf), where we want to synthesize a strategy (aka plan) that maximizes the probability of satisfying the LTLf goal in spite of the trembling hand. We consider both deterministic and nondeterministic (adversarial) domains. We propose solution techniques for both cases by relying respectively on Markov Decision Processes and on Markov Decision Processes with Set-valued Transitions with LTLf objectives, where the set-valued probabilistic transitions capture both the nondeterminism from the environment and the possible action instruction errors from the agent. We formally show the correctness of our solution techniques and demonstrate their effectiveness experimentally through a proof-of-concept implementation.
考虑一个动作来实现其时间目标,但带着“颤抖的手”。在这种情况下,代理商可能会错误地指定一定概率的非意图行动,由于其动作选择机制的故障或粗略而导致的,从而导致可能的目标失败。我们在有限痕迹(LTLf)下对动作进行推理和规划的背景下研究颤抖手问题,我们试图合成一个策略(即计划),使其最大概率地满足LTLf目标,即使代理商犯错误。我们研究了确定性和非确定性(对抗)域。我们分别依赖马尔可夫决策过程和具有LTLf目标的可设值转移的马尔可夫决策过程来提出解决方案,其中集合值概率转移捕捉了环境中的不确定性和代理商的可能的指令错误。我们通过形式化证明展示了我们解决方案的正确性,并通过实验验证了它们的有效性。
https://arxiv.org/abs/2404.16163
Multi-Agent Path Finding (MAPF) is the problem of moving multiple agents from starts to goals without collisions. Lifelong MAPF (LMAPF) extends MAPF by continuously assigning new goals to agents. We present our winning approach to the 2023 League of Robot Runners LMAPF competition, which leads us to several interesting research challenges and future directions. In this paper, we outline three main research challenges. The first challenge is to search for high-quality LMAPF solutions within a limited planning time (e.g., 1s per step) for a large number of agents (e.g., 10,000) or extremely high agent density (e.g., 97.7%). We present future directions such as developing more competitive rule-based and anytime MAPF algorithms and parallelizing state-of-the-art MAPF algorithms. The second challenge is to alleviate congestion and the effect of myopic behaviors in LMAPF algorithms. We present future directions, such as developing moving guidance and traffic rules to reduce congestion, incorporating future prediction and real-time search, and determining the optimal agent number. The third challenge is to bridge the gaps between the LMAPF models used in the literature and real-world applications. We present future directions, such as dealing with more realistic kinodynamic models, execution uncertainty, and evolving systems.
多智能体路径寻径(MAPF)是解决在有限规划时间内将多个智能体从起点移动到目标而避免碰撞的问题。终身MAPF(LMAPF)在MAPF的基础上,通过持续为智能体分配新目标来扩展。我们向2023年机器人跑者联赛MAPF比赛呈现我们的获胜方法,该方法导致我们面临多个有趣的研究挑战和未来方向。在本文中,我们概述了三个主要的研究挑战。第一个挑战是在有限规划时间内(例如每步1s)搜索高质量LMAPF解决方案,对大量智能体(例如10,000个)或极高 agent density(例如97.7%)进行规划。我们提出了未来的研究方向,例如开发更具竞争力的基于规则的MAPF算法和并行化最先进的MAPF算法。第二个挑战是减轻LMAPF算法中的拥塞和视野行为的影响。我们提出了未来的研究方向,例如开发移动引导和交通规则以减少拥塞,包括未来的预测和实时搜索,以及确定最优的智能体数量。第三个挑战是桥接文献中使用的LMAPF模型与现实应用之间的差距。我们提出了未来的研究方向,例如处理更真实的动力学模型、执行不确定性以及不断演变的系统。
https://arxiv.org/abs/2404.16162
The advent of personalized content generation by LLMs presents a novel challenge: how to efficiently adapt text to meet individual preferences without the unsustainable demand of creating a unique model for each user. This study introduces an innovative online method that employs neural bandit algorithms to dynamically optimize soft instruction embeddings based on user feedback, enhancing the personalization of open-ended text generation by white-box LLMs. Through rigorous experimentation on various tasks, we demonstrate significant performance improvements over baseline strategies. NeuralTS, in particular, leads to substantial enhancements in personalized news headline generation, achieving up to a 62.9% improvement in terms of best ROUGE scores and up to 2.76% increase in LLM-agent evaluation against the baseline.
个性化内容生成由LLM的问世带来了一个新的挑战:如何高效地将文本适应于满足个人偏好,而不会产生每个用户都要求创建独特模型的不可持续需求。本研究介绍了一种创新的方法,该方法采用神经随机游走算法动态优化基于用户反馈的软指令嵌入,从而增强LLM在开放性文本生成中的个性化。通过在各种任务上进行严谨的实验,我们证明了与基线策略相比,具有显著的性能提升。特别是,NeuralTS在个性化新闻标题生成方面取得了很大的提升,最佳ROUGE得分提高了62.9%,LLM代理评估基准测试中的评估值增加了2.76%。
https://arxiv.org/abs/2404.16115
This study explores the use of Large Language Models (LLMs) for automatic evaluation of knowledge graph (KG) completion models. Historically, validating information in KGs has been a challenging task, requiring large-scale human annotation at prohibitive cost. With the emergence of general-purpose generative AI and LLMs, it is now plausible that human-in-the-loop validation could be replaced by a generative agent. We introduce a framework for consistency and validation when using generative models to validate knowledge graphs. Our framework is based upon recent open-source developments for structural and semantic validation of LLM outputs, and upon flexible approaches to fact checking and verification, supported by the capacity to reference external knowledge sources of any kind. The design is easy to adapt and extend, and can be used to verify any kind of graph-structured data through a combination of model-intrinsic knowledge, user-supplied context, and agents capable of external knowledge retrieval.
本研究探讨了使用大型语言模型(LLMs)自动评估知识图(KG)完成模型的应用。历史上,验证知识图中的有效信息是一个具有挑战性的任务,需要大规模的人类标注,代价高昂。随着通用生成式人工智能(GSA)和LLM的出现,现在可能用生成代理来代替人机交互验证。我们提出了一个使用生成模型验证知识图的一致性和验证框架。该框架基于LLM输出结构和语义验证的最近开源发展,以及支持外部知识来源访问的能力。该设计易于调整和扩展,可以通过模型固有知识、用户提供的上下文和支持外部知识检索的代理来验证任何类型的图状数据。
https://arxiv.org/abs/2404.15923
Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.
强化学习是一种解决复杂问题最优解的流行方法。像Q-学习这样的算法在解决没有环境模型的随机问题方面表现出色。然而,它们解决确定性问题的时间比必要的要长。通过引入基于模型的方法,可以提高Q-学习的确定性问题的解决能力。本文介绍了递归后反向Q学习(RBQL)代理,它探索并构建环境的模型。在达到终端状态后,它通过这个模型递归地传播其值。这让每个状态都能在没有长篇累赘的学习过程中评估到最优值。在找寻迷宫中最短路径的例子中,这个代理大大超过了普通Q-学习代理的表现。
https://arxiv.org/abs/2404.15822
Cooperative Adaptive Cruise Control (CACC) represents a quintessential control strategy for orchestrating vehicular platoon movement within Connected and Automated Vehicle (CAV) systems, significantly enhancing traffic efficiency and reducing energy consumption. In recent years, the data-driven methods, such as reinforcement learning (RL), have been employed to address this task due to their significant advantages in terms of efficiency and flexibility. However, the delay issue, which often arises in real-world CACC systems, is rarely taken into account by current RL-based approaches. To tackle this problem, we propose a Delay-Aware Multi-Agent Reinforcement Learning (DAMARL) framework aimed at achieving safe and stable control for CACC. We model the entire decision-making process using a Multi-Agent Delay-Aware Markov Decision Process (MADA-MDP) and develop a centralized training with decentralized execution (CTDE) MARL framework for distributed control of CACC platoons. An attention mechanism-integrated policy network is introduced to enhance the performance of CAV communication and decision-making. Additionally, a velocity optimization model-based action filter is incorporated to further ensure the stability of the platoon. Experimental results across various delay conditions and platoon sizes demonstrate that our approach consistently outperforms baseline methods in terms of platoon safety, stability and overall performance.
合作自适应巡航控制(CACC)代表了一种在连接和自动驾驶车辆(CAV)系统中协调车辆编队运动的典型控制策略,显著提高了交通效率和降低了能源消耗。近年来,数据驱动的方法,如强化学习(RL),已经被采用来解决这个任务,因为它们在效率和灵活性方面具有显著优势。然而,当前基于RL的方法很少考虑到实世界CACC系统中经常出现的延迟问题。为了解决这个问题,我们提出了一个针对延迟敏感的多代理器强化学习(DAMARL)框架,旨在实现CACC的安全和稳定控制。我们使用多代理器延迟感知马尔可夫决策过程(MADA-MDP)来建模整个决策过程,并开发了一种集中训练和分布式执行(CTDE)的MARL框架,用于分布式控制CACC编队。引入了注意机制的策略网络,以提高CAV通信和决策的性能。此外,还引入了基于速度优化模型的动作滤波器,进一步确保编队的稳定性。在不同的延迟条件和编队大小等实验条件下,我们发现,我们的方法在编队安全、稳定和整体性能方面 consistently超过了基线方法。
https://arxiv.org/abs/2404.15696
Affordances, a concept rooted in ecological psychology and pioneered by James J. Gibson, have emerged as a fundamental framework for understanding the dynamic relationship between individuals and their environments. Expanding beyond traditional perceptual and cognitive paradigms, affordances represent the inherent effect and action possibilities that objects offer to the agents within a given context. As a theoretical lens, affordances bridge the gap between effect and action, providing a nuanced understanding of the connections between agents' actions on entities and the effect of these actions. In this study, we propose a model that unifies object, action and effect into a single latent representation in a common latent space that is shared between all affordances that we call the affordance space. Using this affordance space, our system is able to generate effect trajectories when action and object are given and is able to generate action trajectories when effect trajectories and objects are given. In the experiments, we showed that our model does not learn the behavior of each object but it learns the affordance relations shared by the objects that we call equivalences. In addition to simulated experiments, we showed that our model can be used for direct imitation in real world cases. We also propose affordances as a base for Cross Embodiment transfer to link the actions of different robots. Finally, we introduce selective loss as a solution that allows valid outputs to be generated for indeterministic model inputs.
Affordances,这个概念源于生态心理学,是由詹姆斯·J·吉布森(James J. Gibson)先驱性地提出的,已成为理解个体与其环境之间动态关系的坚实基础。它超越了传统的感知和认知范式,代表物体在特定环境中提供的潜在效果和行动可能性。作为一个理论透镜,affordances在效果和行为之间搭建了桥梁,提供了实体中代理商行动对实体和这些行动的影响的细微理解。在这项研究中,我们提出了一个将物体、行为和效果统一为单个潜在表示的模型,称为affordance空间。利用这个affordance空间,我们的系统能够在给定动作和物体时生成效果轨迹,能够在给定效果轨迹和物体时生成行为轨迹。在实验中,我们证明了我们的模型不仅学习了每个物体的行为,还学习了我们称之为等价物的物体之间的affordance关系。除了模拟实验之外,我们还证明了我们的模型可以在现实世界 case 直接仿写。最后,我们提出了affordance作为跨身体转移的基础,将不同机器人的行动联系起来。此外,我们还引入了选择性损失作为解决方案,允许为不确定模型输入生成有效的输出。
https://arxiv.org/abs/2404.15648
Reinforcement learning (RL) with continuous state and action spaces remains one of the most challenging problems within the field. Most current learning methods focus on integral identities such as value functions to derive an optimal strategy for the learning agent. In this paper, we instead study the dual form of the original RL formulation to propose the first differential RL framework that can handle settings with limited training samples and short-length episodes. Our approach introduces Differential Policy Optimization (DPO), a pointwise and stage-wise iteration method that optimizes policies encoded by local-movement operators. We prove a pointwise convergence estimate for DPO and provide a regret bound comparable with current theoretical works. Such pointwise estimate ensures that the learned policy matches the optimal path uniformly across different steps. We then apply DPO to a class of practical RL problems which search for optimal configurations with Lagrangian rewards. DPO is easy to implement, scalable, and shows competitive results on benchmarking experiments against several popular RL methods.
强化学习(RL)在具有连续状态和动作空间的情况下仍然是最具挑战性的问题之一。大多数现有学习方法都关注于全局等式,如价值函数,以得出学习代理的最优策略。在本文中,我们研究了原始RL公式的对偶形式,以提出第一个可以处理训练样本有限且历时较短的场景的DRL框架。我们的方法引入了差分策略优化(DPO),这是一种局部运动操作符编码的点间迭代方法。我们证明了DPO的点间收敛估计,并提供了一个与当前理论工作相当的后悔边界。这样的点间估计确保了学习到的策略在不同的步骤上与最优路径保持一致。然后我们将DPO应用于一类使用拉格朗兰奖励寻找最优配置的实践RL问题中。DPO易于实现,具有可扩展性,在基准实验中与几种流行RL方法竞争,表现优异。
https://arxiv.org/abs/2404.15617
This paper proposes a decentralized trajectory planning framework for the collision avoidance problem of multiple micro aerial vehicles (MAVs) in environments with static and dynamic obstacles. The framework utilizes spatiotemporal occupancy grid maps (SOGM), which forecast the occupancy status of neighboring space in the near future, as the environment representation. Based on this representation, we extend the kinodynamic A* and the corridor-constrained trajectory optimization algorithms to efficiently tackle static and dynamic obstacles with arbitrary shapes. Collision avoidance between communicating robots is integrated by sharing planned trajectories and projecting them onto the SOGM. The simulation results show that our method achieves competitive performance against state-of-the-art methods in dynamic environments with different numbers and shapes of obstacles. Finally, the proposed method is validated in real experiments.
本文提出了一种分散式轨迹规划框架,用于解决具有静态和动态障碍物的环境中多个微型无人飞行器(MAVs)的碰撞避免问题。该框架利用了静态和动态占用网格图(SOGM),将预测周围空间邻居的占用状态作为环境表示。基于此表示,我们将动量惯性算法(Kinodynamic A*)和约束跟踪优化算法(Corridor-Constrained Trajectory Optimization)扩展到能够有效处理具有任意形状的静态和动态障碍物。通过共享计划轨迹并将其投影到SOGM,将碰撞避免集成到通信机器人之间。仿真结果表明,与其他方法相比,我们的方法在具有不同数量和形状的障碍物的动态环境中实现了竞争性的性能。最后,所提出的技术在实际实验中得到了验证。
https://arxiv.org/abs/2404.15602
Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50% power consumption.
尖峰神经网络(SNNs)因其在节能和快速推理能力而广泛应用于各种领域。将SNN应用于强化学习(RL)可以显著降低代理程序的计算资源需求,并在资源受限条件下提高算法的性能。然而,在当前的尖峰强化学习(SRL)算法中,多个时间步的模拟结果只能对应于RL中的单步决策。这与大脑的实际时间动态以及SNNs处理时间数据的能力之间存在很大的差异。为了解决这一时间差问题,并更好地利用尖峰神经元的固有时间动态,我们提出了一个新的时间对齐范式(TAP)。它利用尖峰神经元的单步更新来累积历史状态信息,并引入门控单元来增强尖峰神经元的记忆容量。实验结果表明,我们的方法可以与具有类似性能的循环神经网络(RNNs)解决部分可观察的马尔可夫决策过程(POMDP)和多智能体合作问题,但功耗大约为RNN的50%。
https://arxiv.org/abs/2404.15597
The rapidly changing architecture and functionality of electrical networks and the increasing penetration of renewable and distributed energy resources have resulted in various technological and managerial challenges. These have rendered traditional centralized energy-market paradigms insufficient due to their inability to support the dynamic and evolving nature of the network. This survey explores how multi-agent reinforcement learning (MARL) can support the decentralization and decarbonization of energy networks and mitigate the 12 associated challenges. This is achieved by specifying key computational challenges in managing energy networks, reviewing recent research progress on addressing them, and highlighting open challenges that may be addressed using MARL.
由于电力网络 rapidly变化的建筑和功能,以及可再生能源和分布式能源资源的日益普及,产生了各种技术和管理挑战。这使得传统集中式能源市场范式由于无法支持网络的动态和演变性质而变得不足。本调查探讨了多智能体强化学习(MARL)如何支持能源网络的分散化和脱碳,并减轻与12个相关挑战相关的负担。这是通过指定管理能源网络的关键计算挑战,回顾针对这些挑战的最近研究进展,并强调可以使用MARL解决的开放挑战来实现的。
https://arxiv.org/abs/2404.15583