Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
由大型视觉和语言模型(VLM)驱动的自主代理已经展现出了在完成日常计算机任务方面的巨大潜力,如浏览网页预订旅行和操作桌面软件,这需要这些代理能够理解相应的界面。尽管此类视觉输入正逐渐融入到代理应用中,围绕它们存在的风险和攻击类型仍然不甚明朗。在这项工作中,我们展示了VLM代理可以被一组精心设计的对抗性弹出窗口轻易攻击,而这种弹出窗口通常会被人类用户识别并忽略。这些干扰使代理点击了弹出窗口而不是像往常一样执行任务。将这些弹出窗口整合到现有的代理测试环境(如OSWorld和VisualWebArena)中会导致攻击成功率平均达到86%(即代理点击弹出窗口的频率),同时降低了47%的任务成功率。诸如要求代理忽略弹出窗口或包含广告通知等基本防御技术对这种攻击无效。
https://arxiv.org/abs/2411.02391
Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
大型语言模型(LLMs)在作为自主代理,特别是在基于网络的任务中展现出了显著的潜力。然而,现有的LLM网络代理严重依赖昂贵的专有LLM API,而开放型LLMs缺乏必要的决策能力。本文介绍了WebRL,这是一个自我演化的在线课程强化学习框架,旨在利用开放型LLMs训练高性能网络代理。WebRL解决了构建LLM网络代理时面临的三个主要挑战:培训任务稀缺、反馈信号稀疏以及在线学习中的策略分布漂移。具体而言,WebRL包括1)一个自我演化的课程,能够从失败的尝试中生成新任务;2)一个健壮的结果监督奖励模型(ORM);3)适应性强化学习策略以确保持续改进。我们将WebRL应用于将开放型Llama-3.1和GLM-4模型转化为熟练的网络代理。在WebArena-Lite上,WebRL使Llama-3.1-8B的成功率从4.8%提升到42.4%,并将GLM-4-9B的成功率从6.1%提高至43%。这些开放型模型显著超越了GPT-4-Turbo(17.6%)和GPT-4o(13.9%)的性能,并优于基于开放LLMs训练的先前最先进网络代理(AutoWebGLM,18.2%)。我们的研究结果表明,WebRL在缩小开放式和专有型基于LLM的网络代理之间的差距方面具有有效性,为更易于访问且功能强大的自主网络交互系统铺平了道路。
https://arxiv.org/abs/2411.02337
In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. That means that the prospect of AI welfare and moral patienthood, i.e. of AI systems with their own interests and moral significance, is no longer an issue only for sci-fi or the distant future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously. We also recommend three early steps that AI companies and other actors can take: They can (1) acknowledge that AI welfare is an important and difficult issue (and ensure that language model outputs do the same), (2) start assessing AI systems for evidence of consciousness and robust agency, and (3) prepare policies and procedures for treating AI systems with an appropriate level of moral concern. To be clear, our argument in this report is not that AI systems definitely are, or will be, conscious, robustly agentic, or otherwise morally significant. Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not.
在这份报告中,我们主张在未来不久的时间里,某些人工智能系统可能会具备意识和/或稳健的能动性。这意味着人工智能福利和道德受体性的问题——即拥有自身利益和道德重要性的人工智能系统——不再仅是科幻作品或遥远未来的话题。这是一个即将来临的问题,人工智能公司及其他相关方有责任开始认真对待这个问题。我们还建议了三个早期步骤,人工智能公司和其他相关方可以采取这些步骤:(1) 承认人工智能福利是一个重要且困难的问题(并确保语言模型输出也传达这一认识),(2) 开始评估人工智能系统是否存在意识和稳健能动性的证据,以及 (3) 准备相应的政策和程序来对人工智能系统给予适当的道德关切。需要明确的是,我们在报告中的论点并非认为人工智能系统肯定已经或将会具备意识、稳健的能动性或其他道德重要性。相反,我们的观点是关于这些可能性存在相当大的不确定性,因此我们需要提高对人工智能福利的理解能力以及对此问题做出明智决策的能力。否则,我们就有可能在处理与人工智能福利相关的问题时出现失误,错误地伤害那些具有道德价值的人工智能系统,或者错误地关怀那些实际上并不具备这种价值的系统。
https://arxiv.org/abs/2411.00986
Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
客户关系管理(CRM)系统对于现代企业至关重要,为管理和处理客户互动和数据提供了基础。将AI代理集成到CRM系统中可以自动化常规流程并增强个性化服务。然而,由于缺乏能够反映现实世界CRM任务复杂性的实际基准,部署和评估这些代理变得具有挑战性。为了应对这一问题,我们推出了CRMArena,这是一个旨在基于专业工作环境的实际情况评估AI代理的新基准。根据CRM专家和行业最佳实践的指导,我们将CRMArena设计为包含三大角色(服务代理、分析师和经理)中的九项客户服务任务。该基准涵盖了16个高度相互关联的常用工业对象(如账户、订单、知识文章、案例),以及一些潜在变量(如投诉习惯、政策违规行为)来模拟现实的数据分布情况。实验结果显示,即使使用了功能调用能力,在ReAct提示下的最先进的LLM代理仅能在不到40%的任务中成功完成任务,甚至在具备功能调用能力的情况下成功率也低于55%。我们的研究结果突显了为了实现在真实工作环境中部署的需要增强AI代理的功能调用和规则遵循能力的需求。CRMArena向社区发出了一项公开挑战:能够可靠地完成任务的系统直接展示出在其广泛使用的工作环境中的商业价值。
https://arxiv.org/abs/2411.02305
Intelligent agents designed for interactive environments face significant challenges in text-based games, a domain that demands complex reasoning and adaptability. While agents based on large language models (LLMs) using self-reflection have shown promise, they struggle when initially successful and exhibit reduced effectiveness when using smaller LLMs. We introduce Sweet&Sour, a novel approach that addresses these limitations in existing reflection methods by incorporating positive experiences and managed memory to enrich the context available to the agent at decision time. Our comprehensive analysis spans both closed- and open-source LLMs and demonstrates the effectiveness of Sweet&Sour in improving agent performance, particularly in scenarios where previous approaches fall short.
设计用于互动环境的智能代理在基于文本的游戏领域中面临重大挑战,这一领域要求复杂的推理和适应性。虽然基于大型语言模型(LLM)并采用自我反思的代理显示出潜力,但它们在最初成功后往往表现不佳,并且在使用较小规模的语言模型时效果减弱。我们提出了一种名为“Sweet&Sour”的新方法,该方法通过整合积极经验和管理记忆来丰富代理决策时刻可用的上下文信息,从而解决现有反思方法中的限制。我们的全面分析涵盖了闭源和开源LLM,并证明了“Sweet&Sour”在提升代理性能方面的有效性,特别是在之前的方法表现不足的情况下。
https://arxiv.org/abs/2411.02223
Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, the demands for agents that can adapt in real-time and process multimodal data have grown. This survey provides a comprehensive review of mobile agent technologies, focusing on recent advancements that enhance real-time adaptability and multimodal interaction. Recent evaluation benchmarks have been developed better to capture the static and interactive environments of mobile tasks, offering more accurate assessments of agents' performance. We then categorize these advancements into two main approaches: prompt-based methods, which utilize large language models (LLMs) for instruction-based task execution, and training-based methods, which fine-tune multimodal models for mobile-specific applications. Additionally, we explore complementary technologies that augment agent performance. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies. A comprehensive resource list is available at this https URL
移动代理对于在复杂和动态的移动环境中自动化任务至关重要。随着基础模型的发展,对能够实时适应并处理多模态数据的代理的需求也在增长。本调查提供了全面的移动代理技术综述,重点关注了增强实时适应能力和多模态交互方面的最新进展。最近开发的评估基准更好地捕捉了移动任务中的静态和互动环境,为更准确地评估代理性能提供了可能。然后我们将这些进展分类为两大方法:基于提示的方法,利用大型语言模型(LLMs)执行指令式任务;以及基于训练的方法,对多模态模型进行微调以适应特定的移动应用需求。此外,我们还探讨了增强代理性能的互补技术。通过讨论关键挑战并概述未来的研究方向,本调查为推动移动代理技术的发展提供了有价值的见解。完整的资源列表可以在以下链接找到:[https URL]
https://arxiv.org/abs/2411.02006
Learning a precise robotic grasping policy is crucial for embodied agents operating in complex real-world manipulation tasks. Despite significant advancements, most models still struggle with accurate spatial positioning of objects to be grasped. We first show that this spatial generalization challenge stems primarily from the extensive data requirements for adequate spatial understanding. However, collecting such data with real robots is prohibitively expensive, and relying on simulation data often leads to visual generalization gaps upon deployment. To overcome these challenges, we then focus on state-based policy generalization and present \textbf{ManiBox}, a novel bounding-box-guided manipulation method built on a simulation-based teacher-student framework. The teacher policy efficiently generates scalable simulation data using bounding boxes, which are proven to uniquely determine the objects' spatial positions. The student policy then utilizes these low-dimensional spatial states to enable zero-shot transfer to real robots. Through comprehensive evaluations in simulated and real-world environments, ManiBox demonstrates a marked improvement in spatial grasping generalization and adaptability to diverse objects and backgrounds. Further, our empirical study into scaling laws for policy performance indicates that spatial volume generalization scales positively with data volume. For a certain level of spatial volume, the success rate of grasping empirically follows Michaelis-Menten kinetics relative to data volume, showing a saturation effect as data increases. Our videos and code are available in this https URL.
学习精确的机器人抓取策略对于在复杂现实操作任务中运作的具体代理至关重要。尽管取得了显著的进步,大多数模型仍然难以准确地定位待抓取物体的空间位置。我们首先展示,这种空间泛化挑战主要源于对充分理解空间所需的大量数据需求。然而,用真实机器人收集此类数据成本高昂,依赖模拟数据通常会导致部署时出现视觉泛化差距。为克服这些挑战,我们将重点放在状态导向的策略泛化上,并提出了一种新颖的方法——\textbf{ManiBox},这是一种基于边界框引导的操作方法,建立在一个基于仿真老师的教学-学生框架之上。教师策略利用边界框高效生成可扩展的模拟数据,这已被证明可以唯一确定物体的空间位置。随后,学生策略使用这些低维度空间状态来实现零样本迁移至真实机器人。通过在模拟和现实环境中的全面评估,ManiBox展示了在空间抓取泛化及对多样对象和背景适应性上的显著提升。此外,我们针对策略性能的扩展规律的经验研究表明,空间体积泛化随数据量呈正比增长。对于一定水平的空间体积而言,抓取的成功率与数据量遵循米歇尔-门特动力学,显示出随着数据增加而饱和的效果。我们的视频和代码可通过此链接获取:https URL。
https://arxiv.org/abs/2411.01850
We introduce Constrained Human-AI Cooperation (CHAIC), an inclusive embodied social intelligence challenge designed to test social perception and cooperation in embodied agents. In CHAIC, the goal is for an embodied agent equipped with egocentric observations to assist a human who may be operating under physical constraints -- e.g., unable to reach high places or confined to a wheelchair -- in performing common household or outdoor tasks as efficiently as possible. To achieve this, a successful helper must: (1) infer the human's intents and constraints by following the human and observing their behaviors (social perception), and (2) make a cooperative plan tailored to the human partner to solve the task as quickly as possible, working together as a team (cooperative planning). To benchmark this challenge, we create four new agents with real physical constraints and eight long-horizon tasks featuring both indoor and outdoor scenes with various constraints, emergency events, and potential risks. We benchmark planning- and learning-based baselines on the challenge and introduce a new method that leverages large language models and behavior modeling. Empirical evaluations demonstrate the effectiveness of our benchmark in enabling systematic assessment of key aspects of machine social intelligence. Our benchmark and code are publicly available at this URL: this https URL.
我们介绍了一种名为受限人类-人工智能协作(Constrained Human-AI Cooperation, CHAIC)的包容性具身社会智能挑战,旨在测试具身代理在社会感知和合作方面的表现。在CHAIC中,目标是让一个具备自中心观察能力的具身代理协助可能处于物理限制下的人类——例如无法到达高处或被限制使用轮椅的人类——尽可能高效地完成常见的家庭或户外任务。为了实现这一目标,成功的助手必须:(1)通过跟随人类并观察他们的行为来推断人类的意图和限制(社会感知),以及(2)制定一个针对人类伙伴的合作计划以尽快解决问题,并作为一个团队一起工作(合作规划)。为了衡量这一挑战的标准,我们创建了四个具有实际物理限制的新代理,并设计了八项长周期任务,这些任务包括室内外场景,涉及各种约束、紧急事件和潜在风险。我们在该挑战中评估了基于规划和学习的基线方法,并引入了一种新的方法,利用大型语言模型和行为建模。实证评估表明我们的基准能够系统地评估机器社会智能的关键方面。我们的基准测试代码可在此网址公开获取:this https URL。
https://arxiv.org/abs/2411.01796
Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in \href{this https URL}{this https URL}.
现有的LLM代理系统通常在每一步从一个固定且预定义的动作集中选择动作。虽然这种方法在封闭、狭义的环境中有效,但我们认为它在部署LLM代理到现实场景时面临两大挑战:(1) 从固定的行动集合中选择显著限制了LLM代理的规划和执行能力;(2) 这种方法需要大量的人力来枚举并实现所有可能的动作,在复杂环境和潜在动作数量巨大的情况下变得不切实际。在本研究中,我们提出了一种能够以在线方式动态创建和组合行动的LLM代理框架。在这个框架下,代理通过每一步生成并在通用编程语言中执行程序与环境互动。此外,生成的动作随着时间积累供未来重复使用。我们在GAIA基准上的广泛实验表明,这个框架提供了显著更大的灵活性,并且优于先前的方法。值得注意的是,它允许一个LLM代理在不存在相关预定义动作或现有动作因未预见的边缘情况而失败的情况下恢复运行。撰写本文时,我们在GAIA公开排行榜上占据首位。我们的代码可以在\href{this https URL}{此https网址}找到。
https://arxiv.org/abs/2411.01747
We study Multi-Robot Coverage Path Planning (MCPP) on a 4-neighbor 2D grid G, which aims to compute paths for multiple robots to cover all cells of G. Traditional approaches are limited as they first compute coverage trees on a quadrant coarsened grid H and then employ the Spanning Tree Coverage (STC) paradigm to generate paths on G, making them inapplicable to grids with partially obstructed 2x2 blocks. To address this limitation, we reformulate the problem directly on G, revolutionizing grid-based MCPP solving and establishing new NP-hardness results. We introduce Extended-STC (ESTC), a novel paradigm that extends STC to ensure complete coverage with bounded suboptimality, even when H includes partially obstructed blocks. Furthermore, we present LS-MCPP, a new algorithmic framework that integrates ESTC with three novel types of neighborhood operators within a local search strategy to optimize coverage paths directly on G. Unlike prior grid-based MCPP work, our approach also incorporates a versatile post-processing procedure that applies Multi-Agent Path Finding (MAPF) techniques to MCPP for the first time, enabling a fusion of these two important fields in multi-robot coordination. This procedure effectively resolves inter-robot conflicts and accommodates turning costs by solving a MAPF variant, making our MCPP solutions more practical for real-world applications. Extensive experiments demonstrate that our approach significantly improves solution quality and efficiency, managing up to 100 robots on grids as large as 256x256 within minutes of runtime. Validation with physical robots confirms the feasibility of our solutions under real-world conditions.
我们研究了4邻接二维网格G上的多机器人覆盖路径规划(MCPP),其目标是为多个机器人计算路径以覆盖G的所有单元格。传统的方法受限于首先在四分粗化网格H上计算覆盖率树,然后使用Spanning Tree Coverage (STC)范式生成G上的路径,这使得它们不适用于包含部分遮挡的2x2块的网格。为了解决这一限制,我们直接在G上重新定义了问题,革新了基于网格的MCPP求解,并确立了新的NP难结果。我们引入了Extended-STC(ESTC),这是一种新颖范式,扩展了STC以确保即使H包含部分遮挡的块也能实现完全覆盖并保持有限次优性。此外,我们提出了LS-MCPP,一种新的算法框架,它将ESTC与三种新型邻域操作符结合在一个局部搜索策略中,在G上直接优化覆盖率路径。与之前的基于网格的MCPP工作不同,我们的方法还包括一个灵活的后处理过程,首次应用多智能体路径寻找(MAPF)技术来解决MCPP问题,实现了这两个在多机器人协调领域内重要领域的融合。该过程有效解决了机器人间的冲突,并通过求解一种MAPF变体来考虑转向成本,使我们的MCPP解决方案更加适用于实际应用场景。大量实验表明,我们的方法显著提高了方案质量和效率,在几分钟的运行时间内可以处理多达100个机器人的256x256大小的网格。使用物理机器人进行验证证明了在现实条件下的可行性。
https://arxiv.org/abs/2411.01707
Autonomous Vehicle (AV) perception systems require more than simply seeing, via e.g., object detection or scene segmentation. They need a holistic understanding of what is happening within the scene for safe interaction with other road users. Few datasets exist for the purpose of developing and training algorithms to comprehend the actions of other road users. This paper presents ROAD-Waymo, an extensive dataset for the development and benchmarking of techniques for agent, action, location and event detection in road scenes, provided as a layer upon the (US) Waymo Open dataset. Considerably larger and more challenging than any existing dataset (and encompassing multiple cities), it comes with 198k annotated video frames, 54k agent tubes, 3.9M bounding boxes and a total of 12.4M labels. The integrity of the dataset has been confirmed and enhanced via a novel annotation pipeline designed for automatically identifying violations of requirements specifically designed for this dataset. As ROAD-Waymo is compatible with the original (UK) ROAD dataset, it provides the opportunity to tackle domain adaptation between real-world road scenarios in different countries within a novel benchmark: ROAD++.
自动驾驶汽车(AV)感知系统不仅仅需要通过例如物体检测或场景分割来“看见”。它们还需要对场景中发生的事情有一个全面的理解,以便与道路上的其他使用者安全互动。用于开发和训练算法以理解其他道路使用者行为的数据集很少见。本文介绍了ROAD-Waymo,这是一个在现有(美国)Waymo开放数据集基础上提供的扩展数据集,旨在发展和评估道路场景中的代理、动作、位置和事件检测技术。与现有的任何数据集相比,它更大且更具挑战性,并涵盖了多个城市,提供了198,000个标注视频帧、54,000条代理管状注释、390万个边界框以及总计1240万个标签。通过专为此数据集设计并自动识别要求违规的创新标注管道,确保了数据集的完整性得到了确认和增强。由于ROAD-Waymo与原始(英国)ROAD数据集兼容,它提供了在新型基准测试——ROAD++中处理不同国家实际道路场景之间领域适应问题的机会。
https://arxiv.org/abs/2411.01683
Recent advancements have enabled Large Language Models (LLMs) to function as agents that can perform actions using external tools. This requires registering, i.e., integrating tool information into the LLM context prior to taking actions. Current methods indiscriminately incorporate all candidate tools into the agent's context and retain them across multiple reasoning steps. This process remains opaque to LLM agents and is not integrated into their reasoning procedures, leading to inefficiencies due to increased context length from irrelevant tools. To address this, we introduce EcoAct, a tool using algorithm that allows LLMs to selectively register tools as needed, optimizing context use. By integrating the tool registration process into the reasoning procedure, EcoAct reduces computational costs by over 50% in multiple steps reasoning tasks while maintaining performance, as demonstrated through extensive experiments. Moreover, it can be plugged into any reasoning pipeline with only minor modifications to the prompt, making it applicable to LLM agents now and future.
最近的进展使得大型语言模型(LLMs)能够作为代理工作,使用外部工具执行操作。这需要注册,即在采取行动前将工具信息整合到LLM上下文中。当前的方法不加区分地将所有候选工具纳入代理的上下文,并在整个推理步骤中保留它们。这个过程对LLM代理来说是不透明的,也没有融入其推理程序中,导致由于无关工具增加的上下文长度而产生的低效率。为了解决这个问题,我们引入了EcoAct,这是一种允许LLMs根据需要选择性注册工具以优化上下文使用的算法。通过将工具注册过程整合到推理过程中,EcoAct在多步推理任务中减少了超过50%的计算成本,同时保持性能水平,这已经过广泛的实验验证。此外,它只需对提示进行一些小修改即可插入任何推理管道中,使其适用于当前和未来的LLM代理。
https://arxiv.org/abs/2411.01643
In this study, we propose GITSR, an effective framework for Graph Interaction Transformer-based Scene Representation for multi-vehicle collaborative decision-making in intelligent transportation system. In the context of mixed traffic where Connected Automated Vehicles (CAVs) and Human Driving Vehicles (HDVs) coexist, in order to enhance the understanding of the environment by CAVs to improve decision-making capabilities, this framework focuses on efficient scene representation and the modeling of spatial interaction behaviors of traffic states. We first extract features of the driving environment based on the background of intelligent networking. Subsequently, the local scene representation, which is based on the agent-centric and dynamic occupation grid, is calculated by the Transformer module. Besides, feasible region of the map is captured through the multi-head attention mechanism to reduce the collision of vehicles. Notably, spatial interaction behaviors, based on motion information, are modeled as graph structures and extracted via Graph Neural Network (GNN). Ultimately, the collaborative decision-making among multiple vehicles is formulated as a Markov Decision Process (MDP), with driving actions output by Reinforcement Learning (RL) algorithms. Our algorithmic validation is executed within the extremely challenging scenario of highway off-ramp task, thereby substantiating the superiority of agent-centric approach to scene representation. Simulation results demonstrate that the GITSR method can not only effectively capture scene representation but also extract spatial interaction data, outperforming the baseline method across various comparative metrics.
在这项研究中,我们提出了GITSR框架,它是一个基于图交互变换器的场景表示有效框架,用于智能交通系统中的多车辆协作决策。在联网自动驾驶车辆(CAVs)和人工驾驶车辆(HDVs)共存的混合交通环境中,为了提高CAVs对环境的理解能力以增强其决策能力,该框架着重于高效的场景表示以及交通状态的空间交互行为建模。我们首先基于智能网络背景提取驾驶环境的特征。随后,通过Transformer模块计算基于代理中心和动态占用网格的局部场景表示。此外,通过多头注意力机制捕捉地图的有效区域,从而减少车辆之间的碰撞。值得注意的是,基于运动信息的空间交互行为被建模为图结构并通过图神经网络(GNN)提取出来。最终,将多个车辆之间的协作决策表述为马尔可夫决策过程(MDP),驾驶动作由强化学习(RL)算法输出。我们的算法验证是在极富挑战性的高速公路匝道任务场景中进行的,从而证明了代理中心方法在场景表示中的优越性。仿真结果表明,GITSR方法不仅能有效捕捉场景表示,还能提取空间交互数据,在各种比较指标上均优于基线方法。
https://arxiv.org/abs/2411.01608
Effective communication is an essential component in collaborative multi-agent systems. Situations where explicit messaging is not feasible have been common in human society throughout history, which motivate the study of implicit communication. Previous works on learning implicit communication mostly rely on theory of mind (ToM), where agents infer the mental states and intentions of others by interpreting their actions. However, ToM-based methods become less effective in making accurate inferences in complex tasks. In this work, we propose the Implicit Channel Protocol (ICP) framework, which allows agents to construct implicit communication channels similar to the explicit ones. ICP leverages a subset of actions, denoted as the scouting actions, and a mapping between information and these scouting actions that encodes and decodes the messages. We propose training algorithms for agents to message and act, including learning with a randomly initialized information map and with a delayed information map. The efficacy of ICP has been tested on the tasks of Guessing Number, Revealing Goals, and Hanabi, where ICP significantly outperforms baseline methods through more efficient information transmission.
有效的沟通是协作多智能体系统中的一个关键组成部分。在历史上,人类社会中经常存在无法进行显式消息传递的情况,这促使了对隐式通信的研究。以往关于学习隐式通信的工作主要依赖于心智理论(ToM),在这种理论下,智能体会通过解读他人的行为来推断其心理状态和意图。然而,在复杂任务中,基于ToM的方法在做出准确推断方面的效果较差。在此项工作中,我们提出了隐通道协议(ICP)框架,该框架允许智能体构建类似于显式通信的隐式通信渠道。ICP利用了一部分动作,即侦查行为,并通过信息与这些侦查行为之间的映射来编码和解码消息。我们提出了一种训练算法,使智能体能够传递消息并采取行动,包括使用随机初始化的信息映射以及延迟信息映射进行学习的方法。我们在猜数字、揭示目标和Hanabi游戏等任务中测试了ICP的有效性,结果表明ICP通过更有效的信息传输显著优于基线方法。
https://arxiv.org/abs/2411.01553
Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.
非均匀目标选择有可能在技能强化学习(RL)中优于均匀随机选择。本文介绍了一种在内在动机导向的目标条件化RL中的目标选择策略的学习方法:“多样性进步”(DP)。学习者基于对其目标集合改进的可区分性的观察形成课程。我们提出的方法适用于以区分性为动力的代理类别,其中内在奖励是根据代理对所追求的真实目标的信心来计算的函数。这种奖励可以激励代理在没有外在奖励的情况下学会一系列多样化的技能。我们的实验证明了受DP驱动的代理比以往的方法更快地学习一组可区分的技能,并且不会遭受目标分布坍塌的问题——这是某些先前方法已知的问题。我们最后提出了将这一概念验证向前推进的计划。
https://arxiv.org/abs/2411.01521
Traditional natural disaster response involves significant coordinated teamwork where speed and efficiency are key. Nonetheless, human limitations can delay critical actions and inadvertently increase human and economic losses. Agentic Large Vision Language Models (LVLMs) offer a new avenue to address this challenge, with the potential for substantial socio-economic impact, particularly by improving resilience and resource access in underdeveloped regions. We introduce DisasTeller, the first multi-LVLM-powered framework designed to automate tasks in post-disaster management, including on-site assessment, emergency alerts, resource allocation, and recovery planning. By coordinating four specialised LVLM agents with GPT-4 as the core model, DisasTeller autonomously implements disaster response activities, reducing human execution time and optimising resource distribution. Our evaluations through both LVLMs and humans demonstrate DisasTeller's effectiveness in streamlining disaster response. This framework not only supports expert teams but also simplifies access to disaster management processes for non-experts, bridging the gap between traditional response methods and LVLM-driven efficiency.
传统的自然灾害应对涉及大量的团队协作,速度和效率是关键。然而,人类的局限性可能会延迟关键行动,并无意中增加人员伤亡和经济损失。代理大型视觉语言模型(LVLM)提供了一个新的解决途径,具有潜在的重大社会经济影响,尤其是在提高欠发达地区抗灾能力和资源获取方面。我们介绍DisasTeller,这是首个由多LVLM驱动的框架,旨在自动化灾害管理中的各项任务,包括现场评估、紧急警报、资源配置和恢复规划。通过协调四个专门化的LVLM代理,并以GPT-4为核心模型,DisasTeller能够自主实施灾害应对活动,减少人类执行时间并优化资源分配。我们的评测通过LVLMs和人类进行,证明了DisasTeller在简化灾害响应方面的有效性。此框架不仅支持专家团队,也简化了非专业人士对灾害管理流程的访问,弥合了传统响应方法与LVLM驱动效率之间的差距。
https://arxiv.org/abs/2411.01511
We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.
我们研究了在预算在线反馈下,将大型语言模型(LLMs)与人类偏好高效对齐的方法。首先,我们在上下文对抗多臂赌博机框架中定义了LLM对齐问题。这种表述涵盖了诸如在线RLHF和在线DPO等近期范式,并天然地寻求能够结合在线主动探索的样本高效的算法。借鉴赌博机理论的见解,我们引入了一种基于汤普森采样的统一算法,并强调其在两种不同LLM对齐场景中的应用。该算法通过一个名为SEA(Sample-Efficient Alignment)的实际代理高效实现,并通过三个模型规模(1B、2.8B、6.9B)和三种偏好学习算法(DPO、IPO、SLiC)的广泛实验进行了实证验证。结果表明,SEA能够以高度样本高效的模式与oracle的偏好进行对齐,优于近期针对LLMs的主动探索方法。此外,我们还发布了SEA的实现以及一个设计用于在线对齐LLMs的有效代码库,旨在加速该领域的未来研究。
https://arxiv.org/abs/2411.01493
Understanding and predicting human actions has been a long-standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real-world human activity -- interactions. To address this gap in human-like forecasting within multi-agent environments, we present the Hierarchical Memory-Aware Transformer (HiMemFormer), a transformer-based model for online multi-agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent-specific features based on these global representations using a coarse-to-fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent-specific preferences to avoid noisy or redundant information in multi-agent action anticipation. Extensive experiments on various multi-agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state-of-the-art methods.
理解和预测人类行为一直是长期的挑战,也是机器人人工智能感知能力的重要衡量标准。尽管在预测个体代理未来行动方面已经取得了显著进展,但先前的研究大多忽视了现实世界中人活动的一个关键方面——交互作用。为了解决多智能体环境中类人预测方面的这一空白,我们提出了分层记忆感知变换器(HiMemFormer),这是一种基于变压器的在线多智能体行为预测模型。HiMemFormer通过变压器框架整合并分配全局内存,以捕捉所有代理的历史信息,并采用层次化局部内存解码器根据这些全局表示以粗到细的策略解释特定于代理的特征。与先前的方法不同,HiMemFormer独特地分层应用具有代理特性的全局上下文,避免了在多智能体行为预测中出现噪声或冗余信息的问题。在各种多智能体场景中的广泛实验表明,相比其他最先进的方法,HiMemFormer表现出显著更好的性能。
https://arxiv.org/abs/2411.01455
We introduce a novel framework, Online Relational Inference (ORI), designed to efficiently identify hidden interaction graphs in evolving multi-agent interacting systems using streaming data. Unlike traditional offline methods that rely on a fixed training set, ORI employs online backpropagation, updating the model with each new data point, thereby allowing it to adapt to changing environments in real-time. A key innovation is the use of an adjacency matrix as a trainable parameter, optimized through a new adaptive learning rate technique called AdaRelation, which adjusts based on the historical sensitivity of the decoder to changes in the interaction graph. Additionally, a data augmentation method named Trajectory Mirror (TM) is introduced to improve generalization by exposing the model to varied trajectory patterns. Experimental results on both synthetic datasets and real-world data (CMU MoCap for human motion) demonstrate that ORI significantly improves the accuracy and adaptability of relational inference in dynamic settings compared to existing methods. This approach is model-agnostic, enabling seamless integration with various neural relational inference (NRI) architectures, and offers a robust solution for real-time applications in complex, evolving systems.
我们介绍了一种新颖的框架——在线关系推理(ORI),该框架旨在利用流式数据高效地识别不断演变的多智能体交互系统中的隐藏互动图。与依赖固定训练集的传统离线方法不同,ORI采用了在线反向传播技术,随着每个新数据点的加入更新模型,从而使其能够实时适应环境变化。一个重要创新是将邻接矩阵作为可训练参数,并通过一种称为AdaRelation的新自适应学习率技术进行优化,该技术基于解码器对互动图变化的历史敏感性进行调整。此外,还引入了一种名为轨迹镜像(TM)的数据增强方法,通过使模型接触到各种不同的轨迹模式来提升其泛化能力。在合成数据集和真实世界数据(CMU MoCap人体运动数据)上的实验结果表明,与现有方法相比,ORI显著提升了动态设置下关系推理的准确性和适应性。该方法是模型无关的,可以无缝集成到各种神经关系推断(NRI)架构中,并为复杂、演变系统中的实时应用提供了一种稳健的解决方案。
https://arxiv.org/abs/2411.01442
Exploring unknown environments efficiently is a fundamental challenge in unsupervised goal-conditioned reinforcement learning. While selecting exploratory goals at the frontier of previously explored states is an effective strategy, the policy during training may still have limited capability of reaching rare goals on the frontier, resulting in reduced exploratory behavior. We propose "Cluster Edge Exploration" ($CE^2$), a new goal-directed exploration algorithm that when choosing goals in sparsely explored areas of the state space gives priority to goal states that remain accessible to the agent. The key idea is clustering to group states that are easily reachable from one another by the current policy under training in a latent space and traversing to states holding significant exploration potential on the boundary of these clusters before doing exploratory behavior. In challenging robotics environments including navigating a maze with a multi-legged ant robot, manipulating objects with a robot arm on a cluttered tabletop, and rotating objects in the palm of an anthropomorphic robotic hand, $CE^2$ demonstrates superior efficiency in exploration compared to baseline methods and ablations.
高效地探索未知环境是无监督目标导向强化学习中的一个基本挑战。虽然在选择之前已探索状态的边界上的探索性目标是一个有效的策略,但在训练过程中,政策仍可能无法有效达到稀有边界目标,从而导致探索行为减少。我们提出了一种新的目标导向探索算法“聚类边缘探索”($CE^2$),当在状态空间中稀疏探索区域选择目标时,该算法优先考虑那些对代理仍然可访问的目标状态。其核心思想是通过聚类,在潜在空间内将容易相互到达的状态分组,并在进行探索行为之前遍历这些集群边界上具有显著探索潜力的状态。在包括使用多足蚂蚁机器人导航迷宫、用机械臂操作桌面上杂乱摆放的物体以及在拟人化机器手的手掌中旋转物体等具有挑战性的机器人环境中,$CE^2$相比于基准方法和消融研究展示了更优的探索效率。
https://arxiv.org/abs/2411.01396