Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences. Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students. We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity. Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support. Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support. We also employed the LangChain prompt-based evaluation methodology to evaluate the platform's impact, confirming its strong performance in helpfulness, correctness, and completeness. These results underscore the system's ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education. Furthermore, CyberMentor's open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.
许多非传统学生在网络安全项目中往往缺乏来自同龄人、家庭成员和教授的建议,这会妨碍他们的学习经历。此外,由于内容的相关性、建议的地方性、最低专业知识要求以及时机等问题,这些学生可能无法充分利用各种LLM(大型语言模型)驱动的人工智能助手提供的服务。本文通过介绍一款专门为满足这些学生的知识、技能和职业准备咨询需求而设计的应用程序来解决这些问题。我们开发了一个学习工具平台“CyberMentor”,旨在应对网络安全专业学生多样化的需要与痛点。该平台利用代理工作流和生成式大型语言模型(LLMs),并通过检索增强生成技术(RAG)实现准确且上下文相关的信息检索,以确保可访问性和个性化服务。 我们展示了CyberMentor在满足网络安全教育的知识需求、职业市场的适应性要求、分析和编程任务的技能需求以及提供即时按需学习支持方面的作用。通过三种使用场景的应用展示,CyberMentor在促进知识获取与职业准备方面发挥了重要作用,并提供了无缝的技术指导和支持。我们还采用了LangChain提示评价法来评估该平台的影响,确认其在帮助性、准确性和完整性方面的优秀表现。 这些结果强调了该系统支持学生发展实用的网络安全技能的能力,同时提高高等教育中的公平性和可持续性。此外,“CyberMentor”的开源设计允许它被其他学科领域采纳和适应,推动教育创新并扩大其潜在影响。
https://arxiv.org/abs/2501.09709
Values or principles are key elements of human society that influence people to behave and function according to an accepted standard set of social rules to maintain social order. As AI systems are becoming ubiquitous in human society, it is a major concern that they could violate these norms or values and potentially cause harm. Thus, to prevent intentional or unintentional harm, AI systems are expected to take actions that align with these principles. Training systems to exhibit this type of behavior is difficult and often requires a specialized dataset. This work presents a multi-modal dataset illustrating normative and non-normative behavior in real-life situations described through natural language and artistic images. This training set contains curated sets of images that are designed to teach young children about social principles. We argue that this is an ideal dataset to use for training socially normative agents given this fact.
价值观或原则是人类社会的关键要素,它们影响人们的行为和功能,使之遵循一套公认的社会规则以维持社会秩序。随着AI系统在人类社会中的普及,一个主要的担忧在于这些系统可能会违反这些规范或价值,并可能造成伤害。因此,为了防止有意或无意的危害,期望AI系统采取符合这些原则的行动。训练系统表现出这种行为是困难且通常需要专门的数据集。 这项工作提供了一个多模态数据集,通过自然语言和艺术图像描绘了现实生活中规范与非规范的行为。该训练集中包含了一组精心策划的图片,旨在教导儿童关于社会准则的知识。我们主张,鉴于这一事实,这是一个用于训练具有社会规范行为的代理的理想数据集。
https://arxiv.org/abs/2501.09707
The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.
自主AI代理的快速部署在授权、问责制和访问控制方面为数字空间带来了紧迫挑战。需要新的标准来确定AI代理代表谁行事,并指导它们的适当使用,以保护在线空间并解锁将任务委托给自主代理的价值。我们介绍了一种新型框架,用于将经过认证、授权和可审计的权限委派给AI代理,使人类用户能够安全地向代理授予和限制权限范围,同时保持清晰的责任链。此框架建立在现有的身份验证和访问管理协议之上,并通过为特定代理扩展OAuth 2.0和OpenID Connect的身份验证凭证和元数据来维护与现有认证和网络基础设施的兼容性。 此外,我们还提出了一种将灵活、自然语言权限转换为可审计的访问控制配置框架,使AI代理的能力范围能够在各种交互模式下保持稳健。总体而言,这种方法旨在在解决关键的安全性和问责制问题的同时实现AI代理的即时部署,并且努力确保具有自主性的AI系统仅执行适当的操作,并提供一种工具,使数字服务提供商能够安全地启用与AI代理的互动而不会因可扩展性互动带来风险。
https://arxiv.org/abs/2501.09674
In many real-world applications, agents must make sequential decisions in environments where conditions are subject to change due to various exogenous factors. These non-stationary environments pose significant challenges to traditional decision-making models, which typically assume stationary dynamics. Non-stationary Markov decision processes (NS-MDPs) offer a framework to model and solve decision problems under such changing conditions. However, the lack of standardized benchmarks and simulation tools has hindered systematic evaluation and advance in this field. We present NS-Gym, the first simulation toolkit designed explicitly for NS-MDPs, integrated within the popular Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental parameters that characterize non-stationarity from the agent's decision-making module, allowing for modular and flexible adaptations to dynamic environments. We review prior work in this domain and present a toolkit encapsulating key problem characteristics and types in NS-MDPs. This toolkit is the first effort to develop a set of standardized interfaces and benchmark problems to enable consistent and reproducible evaluation of algorithms under non-stationary conditions. We also benchmark six algorithmic approaches from prior work on NS-MDPs using NS-Gym. Our vision is that NS-Gym will enable researchers to assess the adaptability and robustness of their decision-making algorithms to non-stationary conditions.
在许多现实世界的应用中,代理必须在一个条件会因各种外生因素而变化的环境中做出一系列决策。这种非平稳环境对传统假设为动态不变的经典决策模型提出了重大挑战。非平稳马尔可夫决策过程(NS-MDPs)提供了一种建模和解决此类条件下决策问题的框架。然而,缺乏标准化的基准测试和模拟工具阻碍了该领域的系统评估和进展。 我们介绍了 NS-Gym,这是第一个专门为 NS-MDP 设计的仿真工具包,并且它被整合到了流行的 Gymnasium 框架中。在 NS-Gym 中,我们将描述环境非平稳性特征的参数变化与代理决策模块分离开来,从而允许对动态环境进行模块化和灵活地适应。 我们回顾了此前的工作并介绍了一个包含 NS-MDP 关键问题特性及类型的工具包。这个工具包是第一个努力开发一系列标准化接口和基准测试问题以实现非平稳条件下算法的一致性和可重复性评估的尝试。我们也使用 NS-Gym 对六种先前文献中提出的关于 NS-MDP 的算法方法进行了基准测试。 我们的愿景是,NS-Gym 将使研究人员能够评估其决策制定算法在面对非平稳条件时的适应能力和鲁棒性。
https://arxiv.org/abs/2501.09646
This paper addresses the challenges translating case law under Hong Kong's bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before the 1997 handover, a task mandated by the Basic Law. The effort involved significant collaboration among legal, linguistic, and translation experts, resulting in a comprehensive and culturally appropriate bilingual legal system. However, translating case law remains a significant challenge due to the sheer volume and continuous growth of judicial decisions. The paper critiques the governments and judiciarys sporadic and uncoordinated efforts to translate case law, contrasting it with the thorough approach previously taken for statute translation. Although the government acknowledges the importance of legal bilingualism, it lacks a sustainable strategy for translating case law. The Judiciarys position that translating all judgments is unnecessary, unrealistic, and not cost-effectiveis analyzed and critiqued for its impact on legal transparency and public trust. A proposed solution involves leveraging machine translation technology through a human-machine interactive translation platform, which undergoes two major transitions. Initially based on a neural model, the platform transitions to using a large language model for improved translation accuracy. Furthermore, it evolves from a single-agent system to a multi-agent system, incorporating Translator, Annotator, and Proofreader agents. This multi-agent approach, supported by a grant, aims to facilitate efficient, high-quality translation of judicial judgments by integrating advanced artificial intelligence and continuous feedback mechanisms, thus better meeting the needs of a bilingual legal system.
这篇论文探讨了在香港双语法律体系下翻译案例法所面临的挑战。它强调了在1997年移交前将所有成文法条文成功译为中文的初步成就,这是《基本法》规定的任务。这一努力需要法律、语言和翻译专家之间的重大合作,最终形成了一套全面且文化适宜的双语法律系统。然而,由于司法决定的数量庞大并且持续增长,翻译案例法仍然是一个重大的挑战。 论文批评了政府和司法机构在翻译案例法方面所做的零散而不协调的努力,并将其与先前对成文法翻译所采取的彻底方法进行了对比。虽然政府承认法律双语的重要性,但缺乏可持续的策略来解决案例法的翻译问题。司法机关认为全面翻译所有判决是不必要的、不现实的以及成本高昂的观点被分析和批评,指出其对法律透明度及公众信任的影响。 提出的一个解决方案涉及利用机器翻译技术通过一个人机交互式翻译平台实现,该平台经历了两个主要转变。最初基于神经网络模型,随后转向使用大型语言模型以提高翻译准确性。此外,这个平台从单代理系统转变为多代理系统,包括译员、注释者和校对者等角色。 这种多代理方法,在资助项目的支持下,旨在通过整合先进的人工智能技术和持续的反馈机制来促进司法判决高效且高质量的翻译,从而更好地满足双语法律体系的需求。
https://arxiv.org/abs/2501.09444
Agent-based models (ABMs) are valuable for modelling complex, potentially out-of-equilibria scenarios. However, ABMs have long suffered from the Lucas critique, stating that agent behaviour should adapt to environmental changes. Furthermore, the environment itself often adapts to these behavioural changes, creating a complex bi-level adaptation problem. Recent progress integrating multi-agent reinforcement learning into ABMs introduces adaptive agent behaviour, beginning to address the first part of this critique, however, the approaches are still relatively ad hoc, lacking a general formulation, and furthermore, do not tackle the second aspect of simultaneously adapting environmental level characteristics in addition to the agent behaviours. In this work, we develop a generic two-layer framework for ADaptive AGEnt based modelling (ADAGE) for addressing these problems. This framework formalises the bi-level problem as a Stackelberg game with conditional behavioural policies, providing a consolidated framework for adaptive agent-based modelling based on solving a coupled set of non-linear equations. We demonstrate how this generic approach encapsulates several common (previously viewed as distinct) ABM tasks, such as policy design, calibration, scenario generation, and robust behavioural learning under one unified framework. We provide example simulations on multiple complex economic and financial environments, showing the strength of the novel framework under these canonical settings, addressing long-standing critiques of traditional ABMs.
基于代理的模型(ABM)在模拟复杂、可能偏离平衡态的情景时非常有用。然而,长期以来,这些模型一直受到卢卡斯批判的影响,该批判指出代理行为应该适应环境变化。此外,环境本身也会因这种行为的变化而发生变化,从而形成了一个复杂的双层适应问题。最近将多智能体强化学习融入ABM的进展引入了自适应智能体行为的概念,开始解决这一批评的第一部分,但这些方法仍然相对零散,缺乏通用形式化,并且尚未处理同时调整环境层面特征和代理行为的问题。 在本文中,我们开发了一个通用的两层框架——ADaptive AGEnt建模(ADAGE),用于解决这些问题。该框架将双层问题正式定义为具有条件行为策略的斯塔克尔伯格博弈,提供了一种基于解耦非线性方程组来实现自适应代理模型的综合框架。我们展示了这种通用方法如何涵盖了几个常见的(以前被视为独立)ABM任务,如政策设计、校准、情景生成以及在统一框架下的鲁棒行为学习。 我们提供了多个复杂经济和金融环境中的示例模拟,展示了该新框架在这类经典设置下的优势,并解决了对传统ABMs的长期批评。
https://arxiv.org/abs/2501.09429
Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.
传统面对面的心理咨询仍然主要局限于特定群体,通常是那些有心理问题的人的选择。而在线自动化心理咨询为那些因羞耻感而不愿意寻求帮助的人提供了一个潜在的解决方案。认知行为疗法(Cognitive Behavioral Therapy, CBT)是心理治疗中一种重要且广泛使用的方法。大型语言模型(Large Language Models, LLMs)和代理技术的发展使得自动化的CBT诊断与治疗成为可能。然而,目前基于LLM的CBT系统使用的通常是结构固定的代理,这限制了它们自我优化的能力;或者由于冗余的回答模式提供空洞且无益的建议。 在此项工作中,我们利用类似于Quora和“一颗心”单轮咨询模型构建了一个通用代理框架,该框架能够生成高质量的回答以应对单回合的心理咨询服务场景。通过使用双语数据集来评估每个框架所产生的单一回应心理咨询的质量。然后,我们将受实际心理治疗启发的动态路由和监管机制融入其中,构建一个面向CBT的自主多智能体框架,并展示了其广泛的适用性。 实验结果显示,AutoCBT能够提供更高质量的自动化心理健康咨询服务。
https://arxiv.org/abs/2501.09426
Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day-to-day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user's prompts. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly or simply engage in conversation with the user - akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal agent focuses on the research question of identifying circumstances that may require the agent to intervene proactively. This allows the agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using AR. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user's actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding a user to complete procedural tasks.
多模态AI代理是具备互动协作能力的AI模型,能够帮助人类用户解决日常任务。增强现实(AR)头戴式设备可以通过提供第一人称视角的视听观测能力来独特地改善解决程序化日常任务时的用户体验。这种AR功能可以帮助AI代理观察并听取用户的行动,这些动作与多模态的人类用户能力相关联。现有的AI代理,无论是大型语言模型(LLMs)还是多模态视觉-语言模型(VLMs),都具有反应性,这意味着它们在没有读取或聆听人类用户的提示的情况下无法采取任何行动。另一方面,AI代理的主动性可以帮助人类用户发现并纠正代理人观察到的任务中的错误、鼓励用户正确完成任务或简单地与用户进行对话——类似于一个人教导或帮助另一个用户。 我们提出的“即将干预”(YETI)多模态代理专注于识别可能需要代理主动介入的情况的研究问题。这使代理能够理解何时可以在与人类用户的对话中采取行动,以帮助纠正如烹饪等任务中的错误。我们的YETI代理基于可解释的结构相似性(SSIM)概念学习场景理解信号,并根据连续视频帧来定义对齐信号——AI代理可以学会识别用户在执行任务时的操作是否与其预期操作一致。这些信号由我们的AI代理使用,以确定何时应主动介入。 我们在HoloAssist多模态基准测试中的专家代理引导用户完成程序化任务的实例上进行了对比实验,展示了我们这种方法的效果。
https://arxiv.org/abs/2501.09355
Effective chart summary can significantly reduce the time and effort decision makers spend interpreting charts, enabling precise and efficient communication of data insights. Previous studies have faced challenges in generating accurate and semantically rich summaries of time-series data charts. In this paper, we identify summary elements and common hallucination types in the generation of time-series chart summaries, which serve as our guidelines for automatic generation. We introduce ChartInsighter, which automatically generates chart summaries of time-series data, effectively reducing hallucinations in chart summary generation. Specifically, we assign multiple agents to generate the initial chart summary and collaborate iteratively, during which they invoke external data analysis modules to extract insights and compile them into a coherent summary. Additionally, we implement a self-consistency test method to validate and correct our summary. We create a high-quality benchmark of charts and summaries, with hallucination types annotated on a sentence-by-sentence basis, facilitating the evaluation of the effectiveness of reducing hallucinations. Our evaluations using our benchmark show that our method surpasses state-of-the-art models, and that our summary hallucination rate is the lowest, which effectively reduces various hallucinations and improves summary quality. The benchmark is available at this https URL.
有效的图表摘要可以显著减少决策者在解读图表时所需的时间和精力,从而实现数据洞察的精准高效传达。以往的研究在生成时间序列数据图表的准确且语义丰富的总结方面遇到了挑战。本文中,我们识别了时间序列图表摘要中的关键总结元素及常见的幻觉类型,并将其作为自动生成指南。我们介绍了ChartInsighter,这是一种能够自动生成时间序列数据图表摘要的方法,有效减少了图表摘要生成过程中的幻觉现象。 具体而言,我们分配多个代理来生成初始的图表摘要,并在迭代过程中进行协作,在此期间,它们调用外部数据分析模块以提取见解并将其编译成连贯的摘要。此外,我们实施了一种自我一致性测试方法来验证和纠正我们的总结内容。为了评估减少幻觉的有效性,我们创建了一个高质量的图表与摘要基准集合,并在句子级别标注了幻觉类型。 使用该基准进行的评测表明,我们的方法超越了现有的最先进模型,在幻觉发生率方面最低,从而有效减少了各种类型的幻觉并提高了摘要的质量。该基准可在以下链接获取:[此处提供URL]。
https://arxiv.org/abs/2501.09349
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at this https URL.
在现实世界的顺序决策任务中,如自动驾驶、机器人技术和医疗保健领域,从观察到的状态-动作轨迹(state-action trajectories)中学习对于模仿、分类和聚类等任务至关重要。例如,无人驾驶汽车需要复制人类驾驶行为,而机器人系统和医疗健康系统则可以从建模决策序列中受益,不论这些数据是否来自专家。现有的轨迹编码方法通常专注于特定的任务或依赖于奖励信号,这限制了它们在跨领域和任务中的泛化能力。 受嵌入模型如CLIP(Contrastive Language–Image Pre-training)和BERT(Bidirectional Encoder Representations from Transformers)在静态域中取得成功的启发,我们提出了一种将状态-动作轨迹嵌入到一个潜在空间的方法。这种方法旨在捕捉动态决策过程中的技能与能力,并且不需要奖励标签,从而能够在不同领域和任务之间实现更好的泛化。 我们的贡献主要包括三个方面: 1. 我们引入了一种新的轨迹嵌入方法,该方法能够从状态-动作数据中捕获多种能力。 2. 学习到的嵌入具有强大的跨下游任务表示能力,包括模仿、分类、聚类和回归等。 3. 嵌入还展示了独特的性质,如在IQ-Learn中控制代理行为以及潜在空间中的加性结构。 实验结果证实了我们提出的方法优于传统方法,为各种应用提供了更灵活且强大的轨迹表示。我们的代码可在以下网址获取:[这个URL应该是一个实际可用的链接,在此处用占位符"this https URL"代替]。
https://arxiv.org/abs/2501.09327
Despite significant advancements in general-purpose AI agents, several challenges still hinder their practical application in real-world scenarios. First, the limited planning capabilities of Large Language Models (LLM) restrict AI agents from effectively solving complex tasks that require long-horizon planning. Second, general-purpose AI agents struggle to efficiently utilize domain-specific knowledge and human expertise. In this paper, we introduce the Standard Operational Procedure-guided Agent (SOP-agent), a novel framework for constructing domain-specific agents through pseudocode-style Standard Operational Procedures (SOPs) written in natural language. Formally, we represent a SOP as a decision graph, which is traversed to guide the agent in completing tasks specified by the SOP. We conduct extensive experiments across tasks in multiple domains, including decision-making, search and reasoning, code generation, data cleaning, and grounded customer service. The SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks and comparable to domain-specific agent systems. Additionally, we introduce the Grounded Customer Service Benchmark, the first benchmark designed to evaluate the grounded decision-making capabilities of AI agents in customer service scenarios based on SOPs.
尽管通用人工智能代理在许多方面取得了显著进展,但在实际应用场景中仍面临若干挑战。首先,大型语言模型(LLM)的有限规划能力限制了AI代理解决需要长远规划的复杂任务的能力。其次,通用AI代理难以有效利用特定领域的知识和人类专业知识。为此,本文介绍了由标准化操作程序(SOP)引导的智能体(SOP-agent),这是一种通过使用自然语言编写的伪代码样式的标准操作程序来构建领域专用代理的新框架。从形式上讲,我们将一个SOP表示为决策图,并利用该图指导智能体完成由SOP指定的任务。 我们进行了跨多个领域的广泛实验,包括决策制定、搜索与推理、代码生成、数据清理以及基于SOP的客户服务质量保障任务。实验结果显示,SOP-agent表现出卓越的灵活性和适应性,在性能上超过了通用代理框架,并且在特定领域系统中表现相当出色。 此外,本文还引入了基于SOP的第一款评估AI代理在客户服务场景中的基础决策能力基准测试——Grounded Customer Service Benchmark(基于标准操作程序制定的顾客服务基准)。
https://arxiv.org/abs/2501.09316
Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.
构建具备优化效率和适应能力的自主移动机器人(AMR),使其能够应对任务需求的变化和动态环境,是推进建筑机器人技术发展的一个重要目标。这类机器人可以在自动化、减少运营碳足迹和支持模块化施工流程方面发挥关键作用。受生物体自适应自主性的启发,我们引入了内感受(interoception)的概念,它侧重于机器人的内部状态表示,并以此为基础开发自我反思和有意识的学习能力,以实现持续学习和适应性。 本文中,我们将内部状态变量和数学属性视为在共享控制范式中的“认知失调”,其中偶尔有人类干预。我们提出了一种新视角,说明内感受如何通过整合基于网格/图算法的传统启发式成本与神经科学及强化学习的最新进展来帮助构建具有适应性运动规划能力的AMR。 从人类语义输入中提取的声明性和程序性知识被编码到一个超图模型中,该模型与其现场布局的空间配置重叠,用于路径规划。此外,我们设计了一个速度回放模块,采用带有少量样本学习能力的编码器-解码器架构,使机器人能够在情境化的场景中复制速度曲线,以实现多机器人同步和交接协作。 这些“缓存”的知识表示在模拟环境中展示了多机器人运动规划和堆叠任务的效果。本研究的见解为AMR的人工通用智能铺平了道路,并推动它们从复杂性向建筑自动化能力的发展。
https://arxiv.org/abs/2501.09290
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at this https URL .
视觉语言模型(VLMs)在各种移动应用中作为具身AI代理展现出巨大潜力。然而,缺乏评估其空间推理和顺序决策能力的标准闭环基准测试。为了解决这一问题,我们介绍了MetaVQA:一个全面的基准测试工具,旨在通过视觉问答(VQA)和闭合回路模拟来评测并提升VLM对空间关系及场景动态的理解能力。MetaVQA利用nuScenes和Waymo数据集中的Set-of-Mark提示以及自顶向下视图的真实标注,自动根据多样化的现实交通情景生成大量问题-答案对,确保指令具有对象中心性和上下文丰富性。我们的实验表明,使用MetaVQA数据集微调VLM能显著提升其在安全关键模拟中空间推理和具身场景理解的能力,不仅体现在视觉问答准确性的提高上,还表现在出现的安全意识驾驶操作上。此外,学习成果展示了从模拟到现实世界观察的强迁移能力。代码和数据将在[此链接](https://this https URL)公开提供。
https://arxiv.org/abs/2501.09167
Current visual SLAM systems face significant challenges in balancing computational efficiency with robust loop closure handling. Traditional approaches require careful manual tuning and incur substantial computational overhead, while learning-based methods either lack explicit loop closure capabilities or implement them through computationally expensive methods. We present AutoLoop, a novel approach that combines automated curriculum learning with efficient fine-tuning for visual SLAM systems. Our method employs a DDPG (Deep Deterministic Policy Gradient) agent to dynamically adjust loop closure weights during training, eliminating the need for manual hyperparameter search while significantly reducing the required training steps. The approach pre-computes potential loop closure pairs offline and leverages them through an agent-guided curriculum, allowing the model to adapt efficiently to new scenarios. Experiments conducted on TartanAir for training and validated across multiple benchmarks including KITTI, EuRoC, ICL-NUIM and TUM RGB-D demonstrate that AutoLoop achieves comparable or superior performance while reducing training time by an order of magnitude compared to traditional approaches. AutoLoop provides a practical solution for rapid adaptation of visual SLAM systems, automating the weight tuning process that traditionally requires multiple manual iterations. Our results show that this automated curriculum strategy not only accelerates training but also maintains or improves the model's performance across diverse environmental conditions.
当前的视觉SLAM(同步定位与地图构建)系统在计算效率和鲁棒循环闭合处理之间面临着显著挑战。传统方法需要精心的手动调整,并且会产生大量的计算开销,而基于学习的方法要么缺少明确的循环闭合能力,要么通过耗费大量计算资源的方式实现这种能力。我们提出了一种名为AutoLoop的新颖方法,它结合了自动课程学习和高效微调策略来优化视觉SLAM系统。 我们的方法采用深度确定性政策梯度(DDPG)代理,在训练过程中动态调整循环闭合权重,从而消除了手动超参数搜索的需要,并大幅减少了所需的训练步骤。该方法预先计算潜在的循环闭合对并在离线环境下使用代理人引导的课程来利用这些数据,使模型能够高效地适应新场景。 在TartanAir上进行的训练实验,并通过KITTI、EuRoC、ICL-NUIM和TUM RGB-D等多组基准测试验证了AutoLoop的有效性。结果显示,与传统方法相比,AutoLoop不仅实现了可比或更优的表现,而且将训练时间减少了多个数量级。 AutoLoop提供了一种实用的方法来快速适应视觉SLAM系统,自动化以往需要多次手动迭代的权重调整过程。我们的研究结果表明,这种自动课程策略不仅能加速模型的训练进程,还能保持甚至改善其在各种环境条件下的性能表现。
https://arxiv.org/abs/2501.09160
Generative artificial intelligence (GenAI) holds great promise as a tool to support personalized learning. Teachers need tools to efficiently and effectively enhance content readability of educational texts so that they are matched to individual students reading levels, while retaining key details. Large Language Models (LLMs) show potential to fill this need, but previous research notes multiple shortcomings in current approaches. In this study, we introduced a generalized approach and metrics for the systematic evaluation of the accuracy and consistency in which LLMs, prompting techniques, and a novel multi-agent architecture to simplify sixty informational reading passages, reducing each from the twelfth grade level down to the eighth, sixth, and fourth grade levels. We calculated the degree to which each LLM and prompting technique accurately achieved the targeted grade level for each passage, percentage change in word count, and consistency in maintaining keywords and key phrases (semantic similarity). One-sample t-tests and multiple regression models revealed significant differences in the best performing LLM and prompt technique for each of the four metrics. Both LLMs and prompting techniques demonstrated variable utility in grade level accuracy and consistency of keywords and key phrases when attempting to level content down to the fourth grade reading level. These results demonstrate the promise of the application of LLMs for efficient and precise automated text simplification, the shortcomings of current models and prompting methods in attaining an ideal balance across various evaluation criteria, and a generalizable method to evaluate future systems.
生成式人工智能(GenAI)作为支持个性化学习的工具具有巨大的潜力。教师需要能够有效提升教育文本可读性的工具,使其与学生的阅读水平相匹配,并保留关键细节。大型语言模型(LLMs)显示出可以满足这一需求的巨大潜能,但先前的研究指出当前方法存在诸多不足之处。在这项研究中,我们介绍了一种通用的方法和指标,用于系统性地评估大型语言模型、提示技术以及一种新的多代理架构在简化六十篇信息阅读文本方面的准确性与一致性表现,这些文本的难度从十二年级降低至八年级、六年级及四年级水平。我们计算了每个LLM和每种提示技术对各个目标年级准确度实现的程度、单词数量变化百分比以及关键词汇和关键短语的一致性(语义相似度)。单样本t检验和多元回归模型揭示了在四个指标中,最佳表现的LLM及提示方法之间存在显著差异。当尝试将内容简化到四年级阅读水平时,无论是大型语言模型还是提示技术,在年级准确性和关键词汇及关键短语一致性方面都表现出不同的效用。这些结果表明应用LLMs进行高效且精确的自动文本简化具有潜力,并揭示了当前模型和提示方法在实现理想平衡方面的不足之处以及评估未来系统的通用方法。
https://arxiv.org/abs/2501.09158
Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG
大型语言模型(LLMs)通过实现类似人类的文本生成和自然语言理解彻底革新了人工智能领域。然而,它们依赖于静态训练数据的特性限制了其应对动态、实时查询的能力,导致输出过时或不准确。检索增强生成(RAG)作为一种解决方案应运而生,它通过将实时数据检索集成到LLM中来提供上下文相关和最新的响应。尽管RAG系统展现出巨大潜力,但传统的RAG系统受限于静态工作流程,并且缺乏多步推理和复杂任务管理所需的适应性。 代理增强生成(Agentic RAG)则超越了这些限制,在RAG管道中嵌入自主AI代理。这些代理利用代理人设计模式的反思、规划、工具使用及多代理协作,以动态管理检索策略、迭代细化上下文理解,并适应工作流程以满足复杂任务需求。这种集成使得Agentic RAG系统能够在各种应用中提供无与伦比的灵活性、可扩展性和上下文感知能力。 本文综述全面探索了Agentic RAG,从其基础原理和RAG范式的演进开始。它详细介绍了Agentic RAG架构的分类,并强调了在医疗保健、金融和教育等行业中的关键应用,同时探讨了实用实现策略。此外,该综述还讨论了扩展这些系统所面临的挑战,确保伦理决策并优化实际应用性能的问题,并提供了关于实施Agentic RAG框架和技术的详细见解。 通过这样的综合分析,我们可以更深入地理解Agentic RAG的技术细节及其在现实世界中的广泛应用潜力。
https://arxiv.org/abs/2501.09136
The proliferation of misinformation on social media platforms has highlighted the need to understand how individual personality traits influence susceptibility to and propagation of misinformation. This study employs an innovative agent-based modeling approach to investigate the relationship between personality traits and misinformation dynamics. Using six AI agents embodying different dimensions of the Big Five personality traits (Extraversion, Agreeableness, and Neuroticism), we simulated interactions across six diverse misinformation topics. The experiment, implemented through the AgentScope framework using the GLM-4-Flash model, generated 90 unique interactions, revealing complex patterns in how personality combinations affect persuasion and resistance to misinformation. Our findings demonstrate that analytical and critical personality traits enhance effectiveness in evidence-based discussions, while non-aggressive persuasion strategies show unexpected success in misinformation correction. Notably, agents with critical traits achieved a 59.4% success rate in HIV-related misinformation discussions, while those employing non-aggressive approaches maintained consistent persuasion rates above 40% across different personality combinations. The study also revealed a non-transitive pattern in persuasion effectiveness, challenging conventional assumptions about personality-based influence. These results provide crucial insights for developing personality-aware interventions in digital environments and suggest that effective misinformation countermeasures should prioritize emotional connection and trust-building over confrontational approaches. The findings contribute to both theoretical understanding of personality-misinformation dynamics and practical strategies for combating misinformation in social media contexts.
社交媒体平台上虚假信息的泛滥凸显了理解个人性格特征如何影响其对虚假信息的易感性和传播的需求。本研究采用了一种创新的基于代理模型的方法,来探究性格特质与虚假信息动态之间的关系。我们使用六个代表大五人格特质(外向性、宜人性和神经质)不同维度的人工智能代理,在六类不同的虚假信息主题中模拟了互动。通过AgentScope框架利用GLM-4-Flash模型实施的实验生成了90种独特的交互模式,揭示出性格组合如何影响说服力和抵制虚假信息的复杂模式。 我们的研究发现表明,具有分析性和批判性的人格特质在基于证据的讨论中更为有效,而非攻击性的说服策略在纠正虚假信息方面取得了意想不到的成功。值得注意的是,在与HIV相关的虚假信息讨论中,拥有批评性特质的代理成功率为59.4%,而采用非攻击性方法的代理则在不同性格组合中保持了超过40%的一致说服率。 该研究还揭示了一种非传递性的说服效果模式,这挑战了关于基于人格特征的影响的传统假设。这些结果为开发针对数字环境的人格感知干预措施提供了关键见解,并建议有效的虚假信息对策应优先考虑情感连接和信任建立而非对抗性方法。 我们的发现不仅加深了对性格-虚假信息动态的理论理解,还提出了在社交媒体环境中打击虚假信息的具体实用策略。
https://arxiv.org/abs/2501.08985
Urban air mobility (UAM) is a transformative system that operates various small aerial vehicles in urban environments to reshape urban transportation. However, integrating UAM into existing urban environments presents a variety of complex challenges. Recent analyses of UAM's operational constraints highlight aircraft noise and system safety as key hurdles to UAM system implementation. Future UAM air traffic management schemes must ensure that the system is both quiet and safe. We propose a multi-agent reinforcement learning approach to manage UAM traffic, aiming at both vertical separation assurance and noise mitigation. Through extensive training, the reinforcement learning agent learns to balance the two primary objectives by employing altitude adjustments in a multi-layer UAM network. The results reveal the tradeoffs among noise impact, traffic congestion, and separation. Overall, our findings demonstrate the potential of reinforcement learning in mitigating UAM's noise impact while maintaining safe separation using altitude adjustments
城市空中交通(UAM)是一种在城市环境中操作各种小型航空器,以重塑城市交通的变革性系统。然而,在现有的城市环境中整合UAM面临着多种复杂的挑战。最近对UAM运营限制的分析强调了飞机噪音和系统安全是实施UAM系统的两个主要障碍。未来的UAM空中交通管理系统必须确保该系统既安静又安全。我们提出了一种多智能体强化学习方法来管理UAM交通,旨在同时保证垂直间隔和减少噪声。通过大量训练,强化学习代理学会了在多层UAM网络中使用高度调整来平衡这两个主要目标。结果揭示了噪音影响、交通拥堵与间隔之间的权衡关系。总体而言,我们的研究发现表明,强化学习具有通过调整飞行高度来减轻UAM噪音影响并确保安全间隔的潜力。
https://arxiv.org/abs/2501.08941
Exploration is a crucial skill for self-improvement and open-ended problem-solving. However, it remains uncertain whether large language models can effectively explore the state-space. Existing evaluations predominantly focus on the trade-off between exploration and exploitation, often assessed in multi-armed bandit problems. In contrast, this work isolates exploration as the sole objective, tasking the agent with delivering information that enhances future returns. For the evaluation, we propose to decompose missing rewards into exploration and exploitation components by measuring the optimal achievable return for the states already explored. Our experiments with various LLMs reveal that most models struggle to sufficiently explore the state-space and that weak exploration is insufficient. We observe a positive correlation between model size and exploration performance, with larger models demonstrating superior capabilities. Furthermore, we show that our decomposition provides insights into differences in behaviors driven by agent instructions during prompt engineering, offering a valuable tool for refining LLM performance in exploratory tasks.
探索是一项关键技能,对于个人成长和开放式问题解决至关重要。然而,目前尚不清楚大型语言模型是否能有效地探索状态空间。现有的评估方法主要关注于探索与利用之间的权衡,通常通过多臂老虎机(multi-armed bandit)问题进行评估。相比之下,本研究将探索作为唯一目标,要求代理提供能够增强未来回报的信息。在评估方面,我们建议通过测量已经探索的状态所能达到的最佳可实现收益来分解缺失的奖励为探索和利用两部分。 我们的实验结果表明,大多数大型语言模型(LLMs)难以充分地探索状态空间,并且弱化的探索策略是不够的。此外,我们观察到模型规模与探索性能之间存在正相关关系:较大的模型展示出更优秀的探索能力。此外,研究还展示了我们的分解方法能够揭示在提示工程中根据代理指令驱动的行为差异,为改进LLM在探索性任务中的表现提供了有价值工具。
https://arxiv.org/abs/2501.08925
Strategic interactions can be represented more concisely, and analyzed and solved more efficiently, if we are aware of the symmetries within the multiagent system. Symmetries also have conceptual implications, for example for equilibrium selection. We study the computational complexity of identifying and using symmetries. Using the classical framework of normal-form games, we consider game symmetries that can be across some or all players and/or actions. We find a strong connection between game symmetries and graph automorphisms, yielding graph automorphism and graph isomorphism completeness results for characterizing the symmetries present in a game. On the other hand, we also show that the problem becomes polynomial-time solvable when we restrict the consideration of actions in one of two ways. Next, we investigate when exactly game symmetries can be successfully leveraged for Nash equilibrium computation. We show that finding a Nash equilibrium that respects a given set of symmetries is PPAD- and CLS-complete in general-sum and team games respectively -- that is, exactly as hard as Brouwer fixed point and gradient descent problems. Finally, we present polynomial-time methods for the special cases where we are aware of a vast number of symmetries, or where the game is two-player zero-sum and we do not even know the symmetries.
如果能够意识到多智能体系统中的对称性,那么战略互动可以更加简洁地表示,并且能更高效地进行分析和求解。这些对称性也有概念上的意义,例如在平衡选择方面的影响。我们研究了识别和利用对称性的计算复杂度问题。 使用经典的标准形式博弈框架,我们考虑了跨越部分或全部玩家及/或行动的博弈对称性。我们发现游戏中的对称性和图自同构之间有很强的联系,这导致了用于表征游戏中存在的对称性的图自同构和图同构完全性结果。 另一方面,当我们在一种方式中限制行动考虑时(以两种不同方式进行),问题可以在多项式时间内解决。 接下来,我们探讨游戏中的哪些对称性能够成功地被用来计算纳什均衡。我们证明,在一般博弈中寻找尊重给定对称集的纳什均衡问题是PPAD-和CLS完全的——也就是说,它们与布劳威尔不动点和梯度下降问题一样难以解决;而在团队游戏中,则是同样地困难。 最后,对于特殊情况,即当我们了解大量对称性时或当游戏为两人零和博弈且我们甚至不知道这些对称性的情况下,我们提出了多项式时间方法。
https://arxiv.org/abs/2501.08905