Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
有效的跨人类与人工智能(AI)协作不仅依赖于AI代理遵循明确指令的能力,还在于其处理模糊性、不完整信息、无效性和无关信息的沟通能力。格赖斯对话和推理准则通过将模棱两可的指示与合作原则对齐来促进这种协作。我们提出了一种规范框架,该框架结合了格赖斯准则以及认知框架——共同知识(common ground)、相关理论(relevance theory)和心智理论(theory of mind),并将这些整合到大型语言模型(LLM)驱动的代理中。 此规范框架采用了格赖斯关于数量、质量、关联性和方式的原则,用于解析模糊指令。在该框架下,我们引入了Lamoids——基于GPT-4的协作型AI代理。为了评估格赖斯准则对人类与AI协作的影响,我们在实验中比较了一个遵循这些规范的Lamoid版本和一个不遵循规范的版本。 在实验中,Lamoid与人类合作,在一个由门、钥匙和宝石组成的网格世界环境中完成共同目标。实验过程中,该环境既包括清晰指令也包括模糊自然语言指令。我们的结果显示,使用格赖斯准则的Lamoid在任务准确性上优于不遵循这些准则的版本,并且其生成的回答更加明确、准确并且与上下文相关。 这种改进源于所提出的规范框架,它增强了代理的语用推理能力,促进了有效的跨人类和AI协作,并使基于大型语言模型的代理能够进行情境感知的沟通。
https://arxiv.org/abs/2503.14484
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent's specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emph{collaborative self-play}. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emph{transfer} to improve tool use and selective prediction in settings where individual agents are deployed in isolation.
为了成为有用的助手,AI代理必须了解自己的能力和局限性。这包括知道何时从参数知识中作答与使用工具之间的区别、何时信任工具输出以及何时保持谨慎或选择回避。由于这些能力难以通过监督微调来传授(因为需要构建能够反映特定代理能力的例子),我们提出了一种全新的教学方法:\emph{协作自我游戏}。我们构造了多代理合作,其中团队因集体正确地得出答案而获得奖励。这种元知识从互动结构中内置的激励机制中涌现出来。我们的重点在于拥有异构工具(针对特定语料库检索)的小规模代理社会,并且这些代理必须通过最小化自身努力来最大化成功所需的合作。 实验表明,多代理社区中的群体奖励可以诱导出在单个代理独立部署时能够\emph{转移}的策略,从而改进工具使用和选择性预测。
https://arxiv.org/abs/2503.14481
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
大型语言模型(LLMs)与专门的外部工具结合得越来越紧密,但许多任务需要在几乎没有或文档质量很低的情况下实现零样本工具使用。现有的解决方案依赖于手动重写或带标签的数据进行验证,在真正的零样本设置中难以应用。为了解决这些挑战,我们提出了PLAY2PROMPT,这是一个自动化的框架,系统地“玩转”每个工具以探索其输入输出行为。通过这种迭代的试错过程,PLAY2PROMPT能够无需任何标记数据就完善工具文档并生成使用示例。这些示例如何指导LLM进行推理的同时还作为验证手段来进一步提升工具的应用效率。在真实世界任务上的广泛实验表明,PLAY2PROMPT显著提高了开放模型和封闭模型中的零样本工具性能,提供了一种规模化且有效的特定领域工具集成解决方案。
https://arxiv.org/abs/2503.14432
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.
共言语手势传达了多种意义,并在面对面的人际互动中扮演着重要角色。这些手势显著影响接收者的参与度、记忆力、理解力以及对发言人的态度。同样,它们也会影响人类与具身虚拟代理之间的交互。因此,在设计这些虚拟代理时,选择和动画化有意义的手势已成为一个关键焦点。然而,自动化这一手势选择过程带来了重大挑战。先前的手势生成技术从完全自动化的数据驱动方法(往往难以产生上下文相关的手势)到需要专门手势专业知识的更手动的方法(耗时且缺乏通用性)不等。 在本文中,我们利用大型语言模型的语言语义能力来开发一种手势选择方法,该方法建议有意义且合适的共言语手势。首先,我们将描述如何将关于手势的信息编码进GPT-4。然后,我们将进行一项研究,评估不同的提示方法能否有效地选择有意义的、上下文相关的手势,并使其与共言语句适当对应。最后,我们将详细说明并展示这一方法在虚拟代理系统中的实现情况:自动化地选择了相应的手势并在之后将其动画化以增强人类和代理之间的交互。
https://arxiv.org/abs/2503.14408
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.
大型语言模型(LLMs)已经通过实现自动化革新了包括自然语言处理、数据分析和软件开发在内的多个领域。在软件工程中,由于其潜在的复杂开发任务自动化的可能、调试辅助以及提高生产力的能力,由LLM驱动的编码代理受到了广泛关注。然而,现有的方法常常在决策质量方面表现不佳,需要大量的手动干预或低效的计算扩展策略。为了提升编码代理的表现,我们提出了一种名为动态动作重采样(DARS)的新颖推理时间计算扩展方法,这种方法比基线更快速且更有效地从次优决策中恢复。 传统的代理要么遵循线性轨迹,要么依赖于随机采样进行计算扩展,而我们的方法DARS则通过在某些关键决策点分支出一条新轨迹来工作。它根据之前的尝试历史和执行反馈采取不同的行动。我们在SWE-Bench Lite基准上评估了这种方法,并证明该扩展策略使Claude 3.5 Sonnet V2达到了55%的pass@k分数(注:这里的pass@k通常是指模型在前k次预测中至少有一次正确的概率)。我们的框架实现了47%的pass@1比率,超过了最先进的开源框架的表现。
https://arxiv.org/abs/2503.14269
Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.
视觉与语言导航(VLN)系统通常仅专注于离散(全景)或连续(自由运动)范式之一,而忽略了人类居住的动态环境中的复杂性。我们介绍了一个统一的人体感知VLN(HA-VLN)基准测试,该测试将这些范式结合在一起,并在显式的社会意识约束下进行整合。我们的贡献包括: 1. 一个标准化的任务定义,平衡了离散-连续导航和个人空间需求; 2. 一个增强的人类运动数据集(HAPS 2.0)和升级的模拟器,捕捉真实的多个人互动、户外场景以及改进的动作-语言对齐; 3. 在16,844个以人为中心的指令上的广泛基准测试,揭示了多人动态和部分可观察性给领先的VLN代理带来了重大挑战; 4. 现实世界中的机器人测试,在拥挤的室内空间中验证模拟到现实的转移; 5. 一个公开排行榜,支持离散和连续任务之间的透明比较。实验结果表明,在集成社会情境时导航成功率提高且碰撞次数减少,突显了以人为核心的设计需求。 通过发布所有数据集、模拟器、代理代码和评估工具,我们的目标是推动更安全、更具能力和社会责任的VLN研究发展。
https://arxiv.org/abs/2503.14229
The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.
长期以来,Vision-and-Language Navigation(VLN)任务的追求是开发出一种具备强大适应性的具身代理,能够在各种任务中无缝转移其导航能力。尽管近年来取得了显著进展,但大多数方法仍需要针对特定数据集进行训练,因此缺乏在涵盖不同指令类型的多样数据集中泛化的通用性。大型语言模型(LLM)展示了出色的推理和泛化能力,在机器人行动规划方面表现出巨大潜力。在这篇论文中,我们提出了一种创新的层级方法FlexVLN,该方法将基于监督学习的指令跟随者的基本导航能力和LLM计划者的强大泛化能力相结合,从而在各种VLN数据集中实现有效的泛化。此外,还提出了验证机制和多模型集成机制以减少由LLM规划者引起的潜在幻觉,并提高指令跟随者的执行准确性。我们将REVERIE、SOON和CVDN-target作为评估泛化能力的域外数据集。FlexVLN的泛化性能远超所有先前的方法。
https://arxiv.org/abs/2503.13966
Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: \href{this https URL}{this https URL}.
在多智能体系统中,协作感知通过促进各代理之间互补信息的交换来增强整体感知能力。然而,目前主流的协作感知方法依赖于离散化特征图进行融合,这种方法在提取和传输重要特征方面缺乏灵活性,并且难以在融合过程中聚焦于这些重要的特征上。为了应对这些问题,本文提出了一种新的基于Anchor-Centric(锚点中心)范式的协作对象检测方法(ACCO)。该方法避免了网格精度问题,并允许更灵活、高效的以锚为中心的通信和融合。 ACCO主要由三个组成部分构成: 1. **Anchor Featurizing Block (AFB)**:目标是生成锚点提议并将预处理过的锚点查询映射到图像特征上。 2. **Anchor Confidence Generator (ACG)**:设计用于通过仅选择并传输自信锚点中的特征来最小化通信量。 3. **局部-全局融合模块**,其中局部融合基于锚点对齐(LAAF)进行,而全局融合则由空间感知交叉注意力(SACA)执行。LAAF和SACA在多层中运行,使代理能够迭代地以锚为中心进行融合,从而调整锚点提议。 为了评估ACCO的性能,在OPV2V和Dair-V2X数据集上进行了全面实验,结果表明,与现有方法相比,ACCO在减少通信量、提升感知范围以及改进检测精度方面表现出明显优势。相关代码可在此链接访问:[此链接](https://this%20URL)(请将"this https URL"替换为实际的GitHub或其它托管服务上的具体链接)。
https://arxiv.org/abs/2503.13946
While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.
虽然人类的认知在决策过程中会从多样化和专门化的知识来源中检索信息,但目前的检索增强生成(RAG)系统通常通过单一的知识源检索来操作,这导致了认知与算法之间的差距。为了弥合这一差距,我们引入了MoK-RAG,这是一种新颖的多源RAG框架,通过功能性地将大型语言模型(LLM)语料库划分为不同的部分,并实现了一种混合知识路径增强检索机制,从而能够从多个专门化的知识路径中进行检索。 在生成3D模拟环境的应用场景下,我们提出的MoK-RAG3D通过将3D资产划分为不同的部分并基于层级知识树结构进行组织来改进了这一范式。不同于以往仅使用手动评估的方法,我们在3D场景的自动评价方法引入方面进行了开创性的工作。 在我们的实验中,无论是自动化还是人工评估都表明MoK-RAG3D能够帮助具身AI代理生成多样化的场景。
https://arxiv.org/abs/2503.13882
Large Language Models (LLMs) have made significant progress in various fields. However, challenges remain in Multi-Disciplinary Team (MDT) medical consultations. Current research enhances reasoning through role assignment, task decomposition, and accumulation of medical experience. Multi-role collaboration in MDT consultations often results in excessively long dialogue histories. This increases the model's cognitive burden and degrades both efficiency and accuracy. Some methods only store treatment histories. They do not extract effective experience or reflect on errors. This limits knowledge generalization and system evolution. We propose a multi-agent MDT medical consultation framework based on LLMs to address these issues. Our framework uses consensus aggregation and a residual discussion structure for multi-round consultations. It also employs a Correct Answer Knowledge Base (CorrectKB) and a Chain-of-Thought Knowledge Base (ChainKB) to accumulate consultation experience. These mechanisms enable the framework to evolve and continually improve diagnosis rationality and accuracy. Experimental results on the MedQA and PubMedQA datasets demonstrate that our framework achieves accuracies of 90.1% and 83.9%, respectively, and that the constructed knowledge bases generalize effectively across test sets from both datasets.
大型语言模型(LLM)在多个领域取得了显著进展。然而,在跨学科团队(MDT)的医学咨询中仍存在挑战。当前的研究通过角色分配、任务分解和积累医疗经验来增强推理能力。但在多角色协作的MDT咨询过程中,通常会产生过长的对话历史记录。这会增加模型的认知负担,并降低效率和准确性。一些方法仅存储治疗史,而不提取有效经验和反思错误,从而限制了知识泛化和系统的进化。 为了解决这些问题,我们提出了一种基于LLM的多代理MDT医疗咨询框架。该框架采用共识聚合和残差讨论结构进行多次轮询,并利用正确答案知识库(CorrectKB)和思想链知识库(ChainKB)积累咨询经验。这些机制使框架能够进化并持续提高诊断合理性和准确性。 在MedQA和PubMedQA数据集上的实验结果表明,我们的框架分别实现了90.1%和83.9%的准确率,并且构建的知识库能够在两个数据集中测试集合上有效泛化。
https://arxiv.org/abs/2503.13856
The increasing reliance on web interfaces presents many challenges for visually impaired users, showcasing the need for more advanced assistive technologies. This paper introduces WebNav, a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework. WebNav comprises of a hierarchical structure: a Digital Navigation Module (DIGNAV) for high-level strategic planning, an Assistant Module for translating abstract commands into executable actions, and an Inference Module for low-level interaction. A key component is a dynamic labeling engine, implemented as a browser extension, that generates real-time labels for interactive elements, creating mapping between voice commands and Document Object Model (DOM) components. Preliminary evaluations show that WebNav outperforms traditional screen readers in response time and task completion accuracy for the visually impaired. Future work will focus on extensive user evaluations, benchmark development, and refining the agent's adaptive capabilities for real-world deployment.
对视障用户日益依赖网络界面提出了许多挑战,凸显了更先进的辅助技术的需求。本文介绍了WebNav,这是一种基于声音控制的网页导航代理,它利用类似于ReAct的架构和生成式AI来提供这一框架。WebNav包括一个分层结构:数字导航模块(DIGNAV)用于高层次的战略规划;助理模块将抽象命令转换为可执行动作;推理模块进行低层次互动。关键组件是一个动态标签引擎,以浏览器扩展的形式实现,可以实时生成交互元素的标签,并在语音命令和文档对象模型(DOM)组件之间建立映射关系。 初步评估表明,在响应时间和任务完成准确性方面,WebNav的表现优于传统的屏幕阅读器,尤其是在视障用户中。未来的工作将侧重于广泛的用户体验评估、基准开发以及细化代理的适应性能力以实现实际部署。
https://arxiv.org/abs/2503.13843
Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent's reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at this https URL.
强化学习控制算法面临分布外(out-of-distribution)和探索效率低下的挑战。虽然基于模型的强化学习通过构建虚拟环境来增强代理的推理和规划能力,但训练这些虚拟环境却非常复杂。为了建立高效的推断模型并提高学习数据的表现力,我们提出了反事实经验增强 (CEA) 算法。CEA 利用变分自动编码器(variational autoencoders)建模状态转换的动力学模式,并引入随机性来处理非平稳特性。通过反事实推理扩展经验池中的学习数据是这种方法的重点,在遵循双模拟假设的环境中,该方法表现尤为出色。 具有双模拟性质的环境通常由离散的状态和动作空间表示,我们提出了一种基于最大核密度估计熵的采样方法,以将 CEA 扩展到各种不同的环境。通过根据真实信息为反事实状态转换提供奖励信号,CEA 构建了完整的反事实经验,缓解了学习数据中的分布外问题,并在具有不同性质的环境中超越了一般最先进的算法。 最后,我们讨论了生成的反事实经验和实际经验之间的相似性、差异性和特性。代码可在以下链接获得:[此 URL](https://this-url.com) (请将“此 URL”替换为实际提供的链接地址)。
https://arxiv.org/abs/2503.13842
Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent's full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent's performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.
为连续控制机器人设计奖励函数常常会导致微妙的偏差或奖励劫持,尤其是在复杂任务中。基于偏好的强化学习通过从比较反馈而非人工制定信号中学得奖励来缓解部分这些问题,但扩大人类注释规模仍然具有挑战性。近期的研究使用视觉-语言模型(VLM)自动化偏好标记,然而单一最终状态图像通常无法捕捉代理的完整运动轨迹。在本文中,我们提出了一种两步解决方案,既提高了反馈准确性又使奖励学习更符合代理策略。首先,我们在最终观察上叠加轨迹草图以揭示所采取的路径,允许VLM提供更可靠的偏好——在MetaWorld任务中将偏好评分准确度提高大约15-20%。其次,我们通过纳入代理的表现来正则化奖励学习,确保奖励模型基于由当前策略生成的数据进行优化;这一添加使移动任务中的每一集回报提高了约20-30%。在MetaWorld上的实证研究表明,我们的方法实现了例如所有任务中大约70%-80%的成功率,相比之下标准方法成功率低于50%。这些结果强调了结合更丰富的视觉表示与代理意识的奖励正则化相结合的有效性。
https://arxiv.org/abs/2503.13817
The rapid evolution of artificial intelligence (AI) has ushered in a new era of integrated systems that merge computational prowess with human decision-making. In this paper, we introduce the concept of \textbf{Orchestrated Distributed Intelligence (ODI)}, a novel paradigm that reconceptualizes AI not as isolated autonomous agents, but as cohesive, orchestrated networks that work in tandem with human expertise. ODI leverages advanced orchestration layers, multi-loop feedback mechanisms, and a high cognitive density framework to transform static, record-keeping systems into dynamic, action-oriented environments. Through a comprehensive review of multi-agent system literature, recent technological advances, and practical insights from industry forums, we argue that the future of AI lies in integrating distributed intelligence within human-centric workflows. This approach not only enhances operational efficiency and strategic agility but also addresses challenges related to scalability, transparency, and ethical decision-making. Our work outlines key theoretical implications and presents a practical roadmap for future research and enterprise innovation, aiming to pave the way for responsible and adaptive AI systems that drive sustainable innovation in human organizations.
人工智能(AI)的快速演进引领了一个新时代,这个时代将计算能力与人类决策相结合,形成了集成系统。在本文中,我们引入了**编排式分布式智能 (ODI)** 的概念,这是一种全新的范式,它重新定义了 AI 不是孤立自主的代理,而是作为与人类专业知识协同工作的连贯、协调网络。ODI 利用高级编排层、多环反馈机制和高认知密度框架,将静态记录保存系统转变为动态行动导向环境。通过回顾多智能体系统的文献、近期的技术进步以及行业论坛的实际见解,我们认为 AI 的未来在于在以人为中心的工作流程中整合分布式智能。这种方法不仅增强了运营效率和战略灵活性,还解决了可扩展性、透明度和伦理决策等挑战。 我们的工作概述了关键的理论影响,并提出了面向未来研究和企业创新的实际路线图,旨在为推动人类组织可持续创新的责任与适应性的 AI 系统铺平道路。
https://arxiv.org/abs/2503.13754
Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to accomplish tasks, their performance gains across popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the need to analyze the challenges hindering MAS effectiveness. In this paper, we present the first comprehensive study of MAS challenges. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerges iteratively from agreements among three expert annotators per study, achieving a Cohen's Kappa score of 0.88. These fine-grained failure modes are organized into 3 categories, (i) specification and system design failures, (ii) inter-agent misalignment, and (iii) task verification and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore if identified failures could be easily prevented by proposing two interventions: improved specification of agent roles and enhanced orchestration strategies. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open-source our dataset and LLM annotator.
尽管对多代理系统(MAS)的兴趣日益增长,这些系统中多个大型语言模型(LLM)代理协作以完成任务,但与单个代理框架相比,在流行基准测试中的性能提升仍然很小。这种差距凸显了分析阻碍MAS效果的因素的必要性。在本文中,我们提出了第一个关于MAS挑战的全面研究。我们分析了五个流行的MAS框架,涉及超过150项任务,并且有六名专家人类标注员参与其中。我们识别出了14种独特的失败模式,并提出了一套适用于各种MAS框架的应用分类法。这种分类法通过每项研究中三名专家标注员的协议迭代产生,达到了Cohen's Kappa评分为0.88的标准。 这些细粒度的失败模式被组织成三大类:(i)规范和系统设计失误;(ii)代理间协调不当;以及(iii)任务验证与终止问题。为了支持可扩展性评估,我们将MASFT集成到了LLM-as-a-Judge中。我们还探讨了已识别出的故障是否可以通过两种干预措施轻松预防:改进代理角色的规定以及增强编排策略。我们的研究结果表明,这些被发现的问题需要更复杂的解决方案,为未来的科学研究指明了一条清晰的道路。我们开源了我们的数据集和LLM标注员工具。
https://arxiv.org/abs/2503.13657
Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.
视频作为一种具有独特时间维度的媒介,要求精确的时间定位理解能力,即答案需直接与可视化的、可解释的证据相联系。尽管在大型语言模型中的推理能力已经取得了重大突破,但跨模态推理——尤其是针对视频内容的推理——仍是一个尚未充分探索的领域。在这项工作中,我们介绍了VideoMind,这是一种为时间定位的视频理解设计的新型视频-语言代理。VideoMind包含了两个关键创新: (i) 我们识别了视频时间推理所需的基本能力,并开发了一种基于角色的工作流程,包括一个计划者来协调不同角色、一个定位于时间点的角色来确定特定时刻的信息、一个验证者来评估时间间隔的准确性,以及一个回答者来进行问题回答。 (ii) 为了高效地整合这些多样的角色,我们提出了一种新颖的LoRA链策略(Chain-of-LoRA),通过轻量级的LoRA适配器实现无缝的角色切换,同时避免了使用多个模型带来的开销,从而在效率和灵活性之间找到了平衡点。广泛实验证明,在14个公开基准测试中,我们的代理在各种视频理解任务上达到了最先进的性能水平,包括3项基于时间定位的视频问答、6项视频时间定位以及5项通用视频问答任务,这证明了VideoMind在推进视频代理和长时间序列推理方面具有显著效果。
https://arxiv.org/abs/2503.13444
With the rapid development of artificial intelligence, intelligent decision-making techniques have gradually surpassed human levels in various human-machine competitions, especially in complex multi-agent cooperative task scenarios. Multi-agent cooperative decision-making involves multiple agents working together to complete established tasks and achieve specific objectives. These techniques are widely applicable in real-world scenarios such as autonomous driving, drone navigation, disaster rescue, and simulated military confrontations. This paper begins with a comprehensive survey of the leading simulation environments and platforms used for multi-agent cooperative decision-making. Specifically, we provide an in-depth analysis for these simulation environments from various perspectives, including task formats, reward allocation, and the underlying technologies employed. Subsequently, we provide a comprehensive overview of the mainstream intelligent decision-making approaches, algorithms and models for multi-agent systems (MAS). Theseapproaches can be broadly categorized into five types: rule-based (primarily fuzzy logic), game theory-based, evolutionary algorithms-based, deep multi-agent reinforcement learning (MARL)-based, and large language models(LLMs)reasoning-based. Given the significant advantages of MARL andLLMs-baseddecision-making methods over the traditional rule, game theory, and evolutionary algorithms, this paper focuses on these multi-agent methods utilizing MARL and LLMs-based techniques. We provide an in-depth discussion of these approaches, highlighting their methodology taxonomies, advantages, and drawbacks. Further, several prominent research directions in the future and potential challenges of multi-agent cooperative decision-making are also detailed.
随着人工智能的快速发展,智能决策技术在各种人机竞赛中逐渐超越了人类水平,特别是在复杂的多代理体协作任务场景中。多代理体合作决策涉及多个代理体共同工作以完成既定的任务并实现特定目标。这些技术广泛应用于自动驾驶、无人机导航、灾难救援和模拟军事对抗等现实世界场景。 本文首先对用于多代理体合作决策的领先仿真环境和平台进行了全面调查。具体而言,我们从任务格式、奖励分配以及所采用的基础技术等多个角度提供了这些仿真环境的深入分析。接着,本文提供了一种主流智能决策方法、算法和模型的综述,适用于多代理系统(MAS)。这些方法可以大致分为五类:基于规则的方法(主要为模糊逻辑)、基于博弈论的方法、基于进化算法的方法、基于深度多代理强化学习(MARL)的方法以及基于大型语言模型(LLM)推理的方法。鉴于MARL和基于LLMs的决策方法在传统规则、博弈论及进化算法方法上的显著优势,本文专注于使用这些技术的多代理体方法,并对它们进行了深入讨论,强调了其方法分类学、优点和缺点。此外,还详细介绍了未来几个重要的研究方向以及多代理体合作决策可能面临的挑战。
https://arxiv.org/abs/2503.13415
In this paper, we propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors learned a priori under the same domain dynamics but different reward functions. Learning the target behavior from scratch is possible but often inefficient given the available source behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. Assuming that the target reward function is a known function of the source reward functions, our approach to RA computes bounds of the Q function. We introduce an iterative process to tighten the bounds, similar to value iteration. This enables action pruning in the target domain before learning even starts. We refer to such a method as Q-Manipulation (Q-M). We formally prove that our pruning strategy does not affect the optimality of the returned policy while empirically show that it improves the sample complexity. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.
在这篇论文中,我们提出了一种新的奖励适应(RA)解决方案。奖励适应是指学习代理根据一个或多个先前在相同领域动态下但不同奖励函数下的行为来调整到目标奖励函数的问题。从零开始学习目标行为是可能的,但在已有来源行为的情况下通常效率不高。我们的工作代表了通过操作Q函数来进行奖励适应的一种新方法。假设目标奖励函数是源奖励函数的一个已知函数,我们提出的奖励适应方法计算Q函数的边界,并引入一个迭代过程来收紧这些边界,类似于价值迭代。这使得在学习开始之前就能对目标领域中的动作进行剪枝。我们将这种做法称为Q操作(Q-M)。我们正式证明了我们的剪枝策略不会影响返回策略的最优性,并且通过实验证明它可以提高样本效率。Q-M方法在各种合成和仿真环境中进行了评估,以展示其有效性、通用性和实用性。
https://arxiv.org/abs/2503.13414
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at this https URL, and project page at this https URL.
科学研究需要对多模态数据进行复杂的推理,尤其是在生物学领域尤为突出。尽管在人工智能辅助研究方面取得了多模态大型语言模型(MLLM)的最新进展,但现有的多模态推理基准测试仅针对大学水平难度,并且专注于较低级别的感知能力,这远远不足以应对科学研究中所需的复杂多模态推理需求。为弥合这一差距,我们引入了MicroVQA,这是一个视觉问答(VQA)基准,旨在评估研究工作流程中的三种关键推理能力:专家级图像理解、假设生成和实验提案。MicroVQA包含1,042个由生物学家在各种显微镜模式下精心策划的多项选择题,确保这些VQA样本能够代表真正的科学实践。 在构建该基准的过程中,我们发现标准的MCQ(多项选择题)生成方法会导致语言捷径问题,这促使了新两阶段管道的开发:一个优化的语言模型提示结构将问答对转换为MCQ;然后通过基于代理的“RefineBot”对其进行更新以去除这些捷径。 在最先进的MLLM上进行基准测试显示,在该任务上的最佳性能仅为53%。使用较小语言模型构建的模型仅略低于顶级模型,这表明基于语言的推理比多模态推理更具挑战性;并且使用科学文章进行微调可以提高性能。 专家对链式思维响应的分析表明,感知错误是最常见的问题,其次是知识错误和过度泛化错误。这些见解突显了多模态科学推理中的挑战,并证明MicroVQA是推动AI驱动生物医学研究的重要资源。 MicroVQA可在此[URL]获取,项目页面在此[URL]。
https://arxiv.org/abs/2503.13399
As requirements drift with rapid iterations, agile development becomes the dominant paradigm. Goal-driven Requirements Elicitation (RE) is a pivotal yet challenging task in agile project development due to its heavy tangling with adaptive planning and efficient collaboration. Recently, AI agents have shown promising ability in supporting requirements analysis by saving significant time and effort for stakeholders. However, current research mainly focuses on functional RE, and research works have not been reported bridging the long journey from goal to user stories. Moreover, considering the cost of LLM facilities and the need for data and idea protection, privately hosted small-sized LLM should be further utilized in RE. To address these challenges, we propose Goal2Story, a multi-agent fleet that adopts the Impact Mapping (IM) framework while merely using cost-effective sLLMs for goal-driven RE. Moreover, we introduce a StorySeek dataset that contains over 1,000 user stories (USs) with corresponding goals and project context information, as well as the semi-automatic dataset construction method. For evaluation, we proposed two metrics: Factuality Hit Rate (FHR) to measure consistency between the generated USs with the dataset and Quality And Consistency Evaluation (QuACE) to evaluate the quality of the generated USs. Experimental results demonstrate that Goal2Story outperforms the baseline performance of the Super-Agent adopting powerful LLMs, while also showcasing the performance improvements in key metrics brought by CoT and Agent Profile to Goal2Story, as well as its exploration in identifying latent needs.
随着快速迭代,需求变动频繁,敏捷开发成为主流。目标驱动的需求提取(RE)在敏捷项目开发中是一项关键且具挑战性的任务,因为它与适应性计划和高效协作紧密相关。最近,AI代理展示了支持需求分析的潜力,为利益相关者节省了大量时间和精力。然而,目前的研究主要集中在功能需求提取上,并未有研究报道从目标到用户故事的漫长旅程如何跨越这一差距。此外,考虑到大型语言模型设施的成本以及对数据和想法保护的需求,小型且私有的LLM(Large Language Models)在RE中的应用应当得到进一步探索。 为应对这些挑战,我们提出了Goal2Story,这是一个采用影响映射框架的多代理舰队,在需求提取过程中仅使用经济高效的sLLMs(small-sized Large Language Models)。此外,我们引入了StorySeek数据集,该数据集中包含了超过1,000个用户故事及其对应的目标和项目背景信息,并提供了一种半自动的数据集构建方法。为了评估效果,我们提出了两个指标:事实命中率(FHR),用于衡量生成的用户故事与数据集的一致性;以及质量和一致性评估(QuACE),用来评价生成用户故事的质量。 实验结果显示,Goal2Story在性能上超越了采用强大LLM的Super-Agent基线模型。此外,它还展示了通过引入“链式思考”(CoT)和代理配置文件到Goal2Story所带来的关键指标上的性能提升,并探索了识别潜在需求的方法。
https://arxiv.org/abs/2503.13279