Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
大型语言模型(LLMs)通常能够准确地用自然语言描述概率分布,但它们在从这些分布中生成忠实样本时仍存在困难。这种不匹配限制了它们在需要可靠随机性的任务中的应用,例如蒙特卡洛方法、基于代理的仿真和随机决策制定。我们在此背景下研究伯努利分布的知识与采样之间的差距,并提出了语言化的拒绝抽样(Verbalized Rejection Sampling, VRS),这是一种经典拒绝抽样的自然语言版本,它促使LLM对提出的样本进行推理并接受或拒绝这些样本。尽管VRS在内部依赖于相同的伯努利机制,但它显著减少了不同模型的采样偏差。 我们提供理论分析表明,在适度假设下,VRS优于直接抽样,并且改进来自于算法本身以及提示设计。更广泛地说,我们的研究结果展示了如何将经典的概率工具语言化并嵌入到LLM工作流程中以提高可靠性,而无需访问模型内部或进行复杂的提示工程。
https://arxiv.org/abs/2506.09998
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $O(1/\epsilon^2)$.
信息不对称是多代理系统中普遍存在的一种特征,尤其在经济学和社会科学领域表现得尤为明显。在这种背景下,各主体根据私有信息调整行为以最大化自身收益。这种策略性行为往往由于混淆变量的引入而变得复杂。同时,在目标环境中进行实验的难度也带来了知识迁移的重大挑战,这需要将知识从数据更容易获取的环境转移到其他场景中。在此背景下,本文探讨了在线学习中的一个基本问题:我们能否利用非独立同分布(non-i.i.d.)的动作来了解混淆变量,即使在这种情况下仍需实现知识迁移?为此,我们提出了一种样本效率高的算法,旨在准确识别信息不对称条件下的系统动态,并在强化学习框架内有效应对知识转移的挑战,在一个在线策略互动模型下进行。我们的方法可以证明,能够在具有紧致样本复杂度$O(1/\epsilon^2)$的情况下,实现$\epsilon$-最优策略的学习。
https://arxiv.org/abs/2506.09940
Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
大型语言模型(LLMs)已经提升了对话式人工智能助手的能力。然而,系统性地评估这些助手在完成任务时如何应用个性化——即根据个人用户的偏好进行调整——仍然是一个挑战。现有的个性化基准测试主要集中在闲聊、非对话型任务或狭窄的领域上,无法捕捉到个性化任务导向辅助服务的复杂性。为此,我们引入了PersonaLens,这是一个全面的评估体系,用于评价面向任务的人工智能助手在个性化方面的表现。 我们的评估体系包括配备了丰富偏好和互动历史的多样用户档案,以及两个专门针对LLM(大型语言模型)设计的代理:一个与AI助手进行真实任务导向对话的用户代理;另一个使用“将LLM作为评判者”模式来评估个性化的质量、响应质量和任务成功率的评判员代理。通过与当前各种大型语言模型助手在多种任务上的广泛实验,我们揭示了它们在个性化能力方面的显著差异,并为推动对话式人工智能系统的进步提供了重要的见解。
https://arxiv.org/abs/2506.09902
Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.
嵌入式导航是更广泛的嵌入式人工智能研究领域中的重要基石。然而,先前的导航研究被划分为不同的任务/能力(例如,ObjNav、ImgNav和VLN),这些任务在目标和模态方面有所不同,导致数据集和方法的设计往往是独立进行的。在这项工作中,我们朝着能够遵循包含任意多模态与多功能组合的自由形式指令的一般化导航代理迈进了一步。为了实现这一目标,我们提出了一套大规模基准测试及相应的方法,称为OctoNav-Bench和OctoNav-R1。 具体来说,OctoNav-Bench具备连续环境特性,并通过设计的注释流程构建而成。我们在该环境中精心制作了指令-轨迹配对数据集,其中指令以多样化的自由形式呈现,并且其模态与能力可以是任意组合。此外,在OctoNav-Bench内,我们还构造了一个“思考前行动”(TBA-CoT)的数据集来提供背后的操作思维过程。 对于OctoNav-R1,我们基于大规模语言模型(MLLMs)构建了它,并将其改编为一种视觉-语言-动作类型(VLA)的模型,该模型仅根据2D视觉观测数据就能产生低级别的行动。此外,为了适应这种多任务处理需求,我们设计了一个包含三个阶段的混合训练范式(HTP),即:Action-/TBA-SFT、Nav-GPRO和在线强化学习阶段。每个阶段都包含了专门设计的学习策略与奖励机制。 特别是,在TBA-SFT和Nav-GRPO的设计中,受到了OpenAI-o1及DeepSeek-R1的启发,这些模型展示了通过“思考前行动”方式产生出色推理能力的特点。因此,我们旨在研究如何在嵌入式导航领域实现“思考前行动”,以此提高模型向一般化发展的推理能力。具体而言,我们提出了TBA-SFT来利用TBA-CoT数据集对模型进行微调,以作为冷启动阶段,并通过Nav-GPRO进一步提升其思维能力。 最终,OctoNav-R1在与先前方法的比较中显示出了优越的性能表现。
https://arxiv.org/abs/2506.09839
Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge-based reasoning capabilities, and open new paths and avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by agentic AI systems. We review the historical evolution of ID across four distinct stages: rule-based expert systems, task-specific machine learning models, large-scale foundation AI models, and the recent emerging paradigm of multi-agent collaboration. We propose a conceptual framework for ID 4.0 and discuss its potential to support end-to-end automation of engineering design processes through coordinated, autonomous multi-agent-based systems. Furthermore, we discuss future perspectives to enhance and fully realize ID 4.0's potential, including more complex design scenarios, more practical design implementations, novel agent coordination mechanisms, and autonomous design goal-setting with better human value alignment. In sum, these insights lay a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing increasingly complex design challenges.
近年来,智能设计(ID)领域的研究与实践在工程创新、效率、质量和生产力方面取得了显著进步,并彻底改变了工程师的设计思维、行为方式以及他们与设计流程的互动模式。最近出现的基础模型(FMs),特别是大型语言模型(LLMs),已经展示了基于广泛知识的推理能力,并为工程设计进一步转型开辟了新的路径和方向。在此背景下,本文介绍了智能设计4.0(ID 4.0)作为由代理型人工智能系统赋能的新范式。文章回顾了智能设计在四个不同阶段的历史演变:规则基础专家系统、特定任务机器学习模型、大规模基础AI模型以及最近出现的多代理协作新范式。我们提出了一个概念框架来定义ID 4.0,并讨论了其通过自主协调的多代理系统支持工程设计过程全流程自动化的能力。此外,本文还探讨了增强和充分实现ID 4.0潜力的未来展望,包括更为复杂的设景、更具实用性的设计方案、新颖的代理协作机制以及与人类价值观更好地对齐的自主设计目标设定。 总的来说,这些见解为智能设计向更加适应性、自主性和有效性的方向发展奠定了基础,以应对日益复杂的设计挑战。
https://arxiv.org/abs/2506.09755
Absolute localization, aiming to determine an agent's location with respect to a global reference, is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable. Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in GNSS-denied scenarios. However, existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations. To overcome these limitations, in this paper, we introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and structure-constrained coarse matching module with a lightweight fine-grained matching module. Specifically, in the coarse matching module, semantic features derived from a vision foundation model first establish region-level correspondences under semantic and structural constraints. Then, the fine-grained matching module is applied to extract fine features and establish pixel-level correspondences. Building upon this, a UAV absolute visual localization pipeline is constructed without any reliance on relative localization techniques, mainly by employing an image retrieval module before the proposed hierarchical image matching modules. Experimental evaluations on public benchmark datasets and a newly introduced CS-UAV dataset demonstrate superior accuracy and robustness of the proposed method under various challenging conditions, confirming its effectiveness.
绝对定位,旨在确定无人飞行器(UAV)相对于全球参考的位置,在各种应用中至关重要。然而,在无法使用全球导航卫星系统(GNSS)信号的情况下,这一任务变得极具挑战性。基于视觉的绝对定位方法通过将无人机当前视角与参考卫星地图中的位置进行匹配来估算其位置,因此在无GNSS情况下变得流行。然而,现有的大多数方法主要依赖于传统的和低层次的图像匹配技术,这导致了由于跨源差异和时间变化引入的重大困难。 为克服这些限制,在本文中我们介绍了一种用于无人机绝对定位的分层跨源图像匹配方法。该方法整合了一个基于语义感知和结构约束的粗略匹配模块与一个轻量级精细匹配模块。具体来说,在粗略匹配阶段,首先通过视觉基础模型提取出的语义特征在语义及结构约束下建立区域级别的对应关系。随后,精细匹配模块用于提取细粒度特征并建立像素级别的对应关系。 在此基础上,我们构建了一个不依赖于相对定位技术的无人机绝对视觉定位流水线,并主要利用图像检索模块之前提出的分层图像匹配模块。实验评估在公开基准数据集和新引入的CS-UAV数据集中进行,证明了所提出方法在各种挑战性条件下的优越准确性和鲁棒性,确认了其有效性。
https://arxiv.org/abs/2506.09748
Monitoring Machine Learning (ML) models in production environments is crucial, yet traditional approaches often yield verbose, low-interpretability outputs that hinder effective decision-making. We propose a cognitive architecture for ML monitoring that applies feature engineering principles to agents based on Large Language Models (LLMs), significantly enhancing the interpretability of monitoring outputs. Central to our approach is a Decision Procedure module that simulates feature engineering through three key steps: Refactor, Break Down, and Compile. The Refactor step improves data representation to better capture feature semantics, allowing the LLM to focus on salient aspects of the monitoring data while reducing noise and irrelevant information. Break Down decomposes complex information for detailed analysis, and Compile integrates sub-insights into clear, interpretable outputs. This process leads to a more deterministic planning approach, reducing dependence on LLM-generated planning, which can sometimes be inconsistent and overly general. The combination of feature engineering-driven planning and selective LLM utilization results in a robust decision support system, capable of providing highly interpretable and actionable insights. Experiments using multiple LLMs demonstrate the efficacy of our approach, achieving significantly higher accuracy compared to various baselines across several domains.
在生产环境中监控机器学习(ML)模型至关重要,但传统方法通常会产生冗长且难以理解的输出结果,这阻碍了有效的决策制定。我们提出了一种基于大语言模型(LLMs)的认知架构来改进这一过程,通过应用特征工程原则于这些代理上,显著提高了监控输出的可解释性。 我们的核心方法是一个决策程序模块,该模块通过三个关键步骤模拟特征工程:重构、分解和编译。重构阶段优化数据表示以更好地捕捉特征语义,使LLM能够关注监测数据中的重要方面并减少噪声和其他无关信息的影响。分解过程将复杂的信息细化为详细的分析部分,而编译则整合子洞察结果形成清晰且易于理解的输出。 这种方法促进了更确定性的规划方式,减少了对LLM生成的不一致和过于通用的计划的依赖。通过特征工程驱动的规划与选择性使用LLM相结合,我们创建了一个强大且具有高度可解释性和操作性的决策支持系统。 实验采用多个大语言模型验证了我们的方法的有效性,在几个领域内实现了比各种基准更高的准确性,展示了该架构在实际应用中的潜力和可靠性。
https://arxiv.org/abs/2506.09742
Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we propose Intent Factored Generation (IFG), factorising the sampling process into two stages. First, we sample a semantically dense intent, e.g., a summary or keywords. Second, we sample the final response conditioning on both the original prompt and the intent from the first stage. This allows us to use a higher temperature during the intent step to promote conceptual diversity, and a lower temperature during the final generation to ensure the outputs are coherent and self-consistent. Additionally, we find that prompting the model to explicitly state its intent for each step of the chain-of-thought before generating the step is beneficial for reasoning tasks. We demonstrate our method's effectiveness across a diverse set of tasks. We show this method improves both pass@k and Reinforcement Learning from Verifier Feedback on maths and code tasks. For instruction-tuning, we combine IFG with Direct Preference Optimisation to increase conversational diversity without sacrificing reward. Finally, we achieve higher diversity while maintaining the quality of generations on a general language modelling task, using a new dataset of reader comments and news articles that we collect and open-source. In summary, we present a simple method of increasing the sample diversity of LLMs while maintaining performance. This method can be implemented by changing the prompt and varying the temperature during generation, making it easy to integrate into many algorithms for gains across various applications.
从固定提示中获取多个有意义且高质量的样本,仍然是大型语言模型面临的一个开放性挑战。目前增加多样性的方法通常只在词汇级别上操作,即对相同的响应进行改写。这会产生问题,因为这种做法会导致推理任务中的探索不足,并导致对话代理变得乏味和重复。为了应对这一挑战,我们提出了意图分解生成(IFG)方法,将采样过程分为两个阶段。首先,我们在较高的温度下抽样一个语义密集的意图,如摘要或关键词,以促进概念多样性;其次,在第二阶段中,我们将根据第一阶段得到的意图以及原始提示来生成最终响应,并且在这个阶段使用较低的温度以确保输出的一致性和自洽性。 我们发现,让模型在每一步推理链之前显式声明其意图对推理任务有益。我们在一系列多样化任务上展示了该方法的有效性:对于数学和编程任务,IFG改进了pass@k和基于验证器反馈的强化学习性能;在指令微调中,我们将IFG与直接偏好优化相结合,在不牺牲奖励的情况下增加了对话多样性;最后,在一个包含读者评论和新闻文章的新数据集上,我们实现了更高的样本多样性,同时保持了生成的质量。 总之,我们提出了一种简单的方法来增加大型语言模型的采样多样性,同时维持其性能。该方法可以通过修改提示以及在生成过程中调整温度来实现,并且可以轻松地集成到许多算法中,在各种应用场景中获得改进效果。
https://arxiv.org/abs/2506.09659
The ongoing evolution of AI paradigms has propelled AI research into the Agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasingly situational and systemic risks. This has brought significant attention to value alignment for AI agents, which aims to ensure that an agent's goals, preferences, and behaviors align with human values and societal norms. This paper reviews value alignment in agent systems within specific application scenarios. It integrates the advancements in AI driven by large models with the demands of social governance. Our review covers value principles, agent system application scenarios, and agent value alignment evaluation. Specifically, value principles are organized hierarchically from a top-down perspective, encompassing macro, meso, and micro levels. Agent system application scenarios are categorized and reviewed from a general-to-specific viewpoint. Agent value alignment evaluation systematically examines datasets for value alignment assessment and relevant value alignment methods. Additionally, we delve into value coordination among multiple agents within agent systems. Finally, we propose several potential research directions in this field.
人工智能范式的持续演化已经将AI研究推进到了代理式人工智能(Agentic AI)阶段。因此,研究的重点从单个代理和简单应用转向了在复杂环境中多代理自主决策和任务协作。随着大型语言模型的不断进步,其应用场景变得更加多样化且复杂化,导致情境风险和社会系统性风险日益增加。这引起了对AI代理价值一致性的广泛关注,即确保代理的目标、偏好和行为与人类价值观及社会规范相符合。本文回顾了特定应用情景下的代理系统中价值一致性问题,并将由大型模型驱动的AI进展与社会治理需求相结合。我们的综述涵盖了价值原则、代理系统的应用场景以及代理的价值一致性评估。 具体来说,从宏观到微观的角度,我们以分层的方式组织价值原则;从一般到具体的视角对代理系统的应用情景进行了分类和回顾;系统地考察了用于价值一致性评估的数据集及相关方法。此外,本文还深入探讨了代理系统内多个代理之间的价值观协调问题。最后,我们提出了该领域的几个潜在研究方向。
https://arxiv.org/abs/2506.09656
Diplomacy is a complex multiplayer game that requires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Language Models (LLMs) offer a promising alternative, leveraging pre-trained knowledge to achieve strong performance with relatively small-scale fine-tuning. However, applying LLMs to Diplomacy remains challenging due to the exponential growth of possible action combinations and the intricate strategic interactions among players. To address this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns equilibrium policies for Diplomacy. DipLLM employs an autoregressive factorization framework to simplify the complex task of multi-unit action assignment into a sequence of unit-level decisions. By defining an equilibrium policy within this framework as the learning objective, we fine-tune the model using only 1.5% of the data required by the state-of-the-art Cicero model, surpassing its performance. Our results demonstrate the potential of fine-tuned LLMs for tackling complex strategic decision-making in multiplayer games.
外交是一种复杂的多人游戏,需要合作与竞争并存,这对人工智能系统提出了重大挑战。传统方法依赖于均衡搜索来生成大量的游戏数据进行训练,这需要巨大的计算资源。大型语言模型(LLMs)提供了一种有前景的替代方案,它们利用预训练的知识,在相对较小规模的微调下就能实现强大的性能。然而,将LLM应用于外交仍然存在挑战,因为可能的动作组合呈指数级增长,玩家之间的复杂策略互动也相当棘手。 为了解决这一挑战,我们提出了DipLLM,这是一种基于细调大型语言模型(LLMs)的代理,用于学习适用于《外交》游戏的均衡策略。DipLLM采用了一种自回归因子分解框架,将复杂的多单位行动分配任务简化为一系列单元级别的决策。通过在该框架内定义一个均衡政策作为学习目标,并使用只有Cicero模型所需数据的1.5%进行微调,我们超越了其性能表现。 我们的研究成果展示了细调后的LLM在处理多人游戏中复杂的战略决策方面所具有的潜力。
https://arxiv.org/abs/2506.09655
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks
任务导向的大型语言模型(LLM)代理在退款资格或取消规则等政策严格的领域中使用越来越广泛。挑战在于确保代理能够始终如一地遵守这些规定和政策,拒绝任何可能违反规定的请求,同时仍能保持友好且自然的互动方式。为此需要开发定制化的设计与评估方法来确保代理能在面对恶意用户行为时具有更强的抵抗力。我们提出了一种新的威胁模型,该模型聚焦于那些试图利用遵循政策的代理为自己谋利的对抗性用户。 为应对这一挑战,我们提出了CRAFT系统,这是一个多代理红队测试系统,它使用了解政策意识型说服策略,在客户服务场景中削弱遵循政策的代理,超越了传统的“破解”方法如DAN提示、情感操控和胁迫。基于现有的tau-bench基准,我们推出了tau-break,一个补充性的评估基准,旨在严格检验代理在面对操纵性用户行为时的稳健性。 最后,我们还评估了几种简单但有效的防御策略。虽然这些措施提供了一定程度的保护,但它们仍不足以应对所有威胁,这突显了需要更强有力、研究导向的安全保障来防止遵循政策的代理遭受对抗性攻击。
https://arxiv.org/abs/2506.09600
Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
尽管基于推理的大规模语言模型(LLMs)在数学和编程方面表现出色,但在知识密集型医学问答方面的能力仍然未被充分探索。为了解决这一问题,我们引入了ReasonMed,这是迄今为止最大的医疗推理数据集,包含从各种LLM生成的170万条初始推理路径中提炼出的37万个高质量示例。ReasonMed通过一个多代理验证和精炼过程构建而成,在此过程中,我们设计了一个“错误精炼器”,用于识别并纠正由验证器标记为存在问题的步骤,以增强推理路径。 借助ReasonMed数据集,我们系统地研究了训练医疗推理模型的最佳实践,并发现结合详细的链式思维(CoT)推理和简洁的答案摘要可以产生最有效的微调策略。基于这一策略,我们训练出了ReasonMed-7B,在规模低于100亿参数的模型中树立了新的标杆,比之前的最佳成绩提高了4.17%,甚至在PubMedQA数据集上超越了LLaMA3.1-70B(一个超过700亿参数的大模型)达4.60%。
https://arxiv.org/abs/2506.09513
We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in practical deployment as feedback can be gathered concurrently. Empirical evaluation confirms that the proposed method is competitive with reward-based reinforcement learning while requiring a small number of preference queries.
我们在一般马尔可夫决策过程中研究基于人类反馈的强化学习,其中智能体通过轨迹层面的偏好比较进行学习。在这种设置下,设计能够选择信息量丰富且能确保理论保障的偏好查询算法是一个核心挑战。我们提出了一种基于随机探索的元算法,这种方法避免了乐观方法所带来的计算难题,并保持可操作性。我们在温和的强化学习预言假设下建立了既有遗憾(regret)又有末态保证。 为了改进查询复杂度,我们引入并分析了一个改进的算法,该算法收集轨迹对的批次并使用最优实验设计选择信息量丰富的比较查询。这种批量结构还使得偏好查询可以并行化,在实际部署中具有相关性,因为反馈可以在同时收集。经验评估证实了所提出的方法在需要少量偏好查询的情况下与基于奖励的强化学习方法竞争。 具体翻译如下: 我们研究了一般马尔可夫决策过程中的基于人类反馈的强化学习问题,其中智能体通过轨迹级别的偏好比较来进行学习。在这种情况下的一个主要挑战是设计能够选择信息量丰富且能确保理论保障的偏好查询算法来识别潜在奖励。为此,我们提出了一种基于随机探索的元算法,该方法避免了乐观策略带来的计算难题,并保持操作上的可行性。我们在温和的强化学习预言假设下建立了遗憾和末态保证。 为了进一步改进查询复杂度,我们引入并分析了一个改进的算法,它通过收集轨迹对的批次来工作,并应用最优实验设计选择信息量丰富的比较查询。这种批量结构还使得偏好查询可以并行化,在实际部署中具有相关性,因为反馈可以在同时获取。经验评估确认了所提出的这种方法在需要少量偏好查询的情况下与基于奖励的强化学习方法竞争。
https://arxiv.org/abs/2506.09508
We introduce Option Kernel Bellman Equations (OKBEs) for a new reward-free Markov Decision Process. Rather than a value function, OKBEs directly construct and optimize a predictive map called a state-time option kernel (STOK) to maximize the probability of completing a goal while avoiding constraint violations. STOKs are compositional, modular, and interpretable initiation-to-termination transition kernels for policies in the Options Framework of Reinforcement Learning. This means: 1) STOKs can be composed using Chapman-Kolmogorov equations to make spatiotemporal predictions for multiple policies over long horizons, 2) high-dimensional STOKs can be represented and computed efficiently in a factorized and reconfigurable form, and 3) STOKs record the probabilities of semantically interpretable goal-success and constraint-violation events, needed for formal verification. Given a high-dimensional state-transition model for an intractable planning problem, we can decompose it with local STOKs and goal-conditioned policies that are aggregated into a factorized goal kernel, making it possible to forward-plan at the level of goals in high-dimensions to solve the problem. These properties lead to highly flexible agents that can rapidly synthesize meta-policies, reuse planning representations across many tasks, and justify goals using empowerment, an intrinsic motivation function. We argue that reward-maximization is in conflict with the properties of compositionality, modularity, and interpretability. Alternatively, OKBEs facilitate these properties to support verifiable long-horizon planning and intrinsic motivation that scales to dynamic high-dimensional world-models.
我们提出了Option Kernel Bellman Equations (OKBE),用于一个新的无奖励的马尔可夫决策过程(MDP)。与传统的价值函数不同,OKBE直接构造并优化一种称为状态-时间选项核(STOK)的预测映射,旨在最大化完成目标的概率同时避免约束违反。在强化学习中的Options Framework中,STOKs是组合型、模块化且易于解释的从初始化到终止的转移核。这意味着: 1. STOK可以通过使用Chapman-Kolmogorov方程进行组合,从而为多个策略提供时空预测,并适用于长时间范围。 2. 高维的STOK可以采用因子化和可重构的形式高效表示与计算。 3. STOK记录了语义上易于解释的目标达成概率和约束违反事件的概率,这对于正式验证是必要的。 对于一个高维的状态转移模型,若该问题本身难以解决,则可以通过局部STOKs和目标条件策略分解这个复杂模型,并将这些策略聚合为因子化目标核。这使得在高维度下以目标级别进行前向规划成为可能,从而解决了原本的问题。这些特性导致了高度灵活的智能体,它们能够迅速合成元策略、跨多个任务重用计划表示,并利用内在动机函数(如赋权能力)来合理化目标。 我们主张,最大化奖励与组合性、模块性和可解释性的性质相冲突。相比之下,OKBE促进了这些性质的支持,从而有助于实现适用于动态高维世界模型的可验证长期规划和内在动机机制。
https://arxiv.org/abs/2506.09499
Scenario-based testing is essential for validating the performance of autonomous driving (AD) systems. However, such testing is limited by the scarcity of long-tailed, safety-critical scenarios in existing datasets collected in the real world. To tackle the data issue, we propose the Adv-BMT framework, which augments real-world scenarios with diverse and realistic adversarial interactions. The core component of Adv-BMT is a bidirectional motion transformer (BMT) model to perform inverse traffic motion predictions, which takes agent information in the last time step of the scenario as input, and reconstruct the traffic in the inverse of chronological order until the initial time step. The Adv-BMT framework is a two-staged pipeline: it first conducts adversarial initializations and then inverse motion predictions. Different from previous work, we do not need any collision data for pretraining, and are able to generate realistic and diverse collision interactions. Our experimental results validate the quality of generated collision scenarios by Adv-BMT: training in our augmented dataset would reduce episode collision rates by 20\% compared to previous work.
场景化测试对于验证自动驾驶(AD)系统的性能至关重要。然而,由于现有数据集中缺少长尾且关键的交通安全场景,这种类型的测试受到了限制。为了解决这个问题,我们提出了Adv-BMT框架,该框架通过添加多样化的现实对抗性交互来增强真实世界的场景。Adv-BMT的核心组件是一个双向运动变压器(BMT)模型,用于执行逆向交通运动预测:它将场景最后时间步中的代理信息作为输入,并按照与时间顺序相反的顺序重建交通状况,直到初始时间点。 Adv-BMT框架采用了两阶段流程:首先进行对抗性初始化,然后是逆向运动预测。与先前的工作不同的是,我们不需要任何碰撞数据来进行预训练,就能够生成现实且多样化的碰撞交互场景。我们的实验结果验证了由Adv-BMT生成的碰撞场景的质量:在我们的增强数据集上进行训练可以比以往的方法将碰撞率减少20%。
https://arxiv.org/abs/2506.09485
Multi-Object Tracking (MOT) plays a crucial role in autonomous driving systems, as it lays the foundations for advanced perception and precise path planning modules. Nonetheless, single agent based MOT lacks in sensing surroundings due to occlusions, sensors failures, etc. Hence, the integration of multiagent information is essential for comprehensive understanding of the environment. This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene by formulating and solving a graph topology-aware optimization problem so as to fuse information coming from multiple vehicles. By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique to smooth the position error of bounding boxes and effectively combine them. In that manner, we reveal and leverage inherent coherences of diverse multi-agent detections, and associate the refined bounding boxes to tracked objects at two stages, optimizing localization and tracking accuracies. An extensive evaluation study has been conducted, using the real-world V2V4Real dataset, where the proposed method significantly outperforms the baseline frameworks, including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various testing sequences.
多目标跟踪(MOT)在自动驾驶系统中扮演着关键角色,因为它为先进的感知和精确路径规划模块奠定了基础。然而,基于单个代理的MOT由于遮挡、传感器故障等原因,在感知周围环境方面存在不足。因此,整合多代理信息对于全面理解环境至关重要。 本文提出了一种新颖的合作MOT框架,通过制定并解决一个具有图拓扑感知优化问题的方法来融合来自多个车辆的信息,以在三维LiDAR场景中跟踪物体。我们利用由检测到的边界框定义的完全连接图拓扑结构,并采用图拉普拉斯处理优化技术来平滑边界框的位置误差,从而有效地结合它们。这样,我们可以揭示和利用不同多代理检测之间的固有一致性,并分两个阶段将精化后的边界框与追踪对象关联起来,从而优化定位和跟踪精度。 通过使用现实世界中的V2V4Real数据集进行了广泛的评估研究,在各种测试序列中,所提出的方法在包括最新深度学习方法DMSTrack和V2V4Real在内的基线框架上表现出显著的优越性。
https://arxiv.org/abs/2506.09469
The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, our goal is to study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents' effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneous Environment Design (HED), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Experiments in matrix games and an embodied Multi-Goal-Capture environment show that, despite the difference in settings, HED rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HED and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
机器人技术、自然和社会中的团队成功往往依赖于不同专业人员之间的分工合作;然而,对于为何多样性会超过同质化团队的原理性解释仍然缺失。我们专注于多智能体任务分配问题,并从奖励设计的角度来探讨这一问题:什么样的目标最适合异质团队?首先考虑一个瞬时而非空间的情境,在此情境中,全局奖励是由两个广义聚合算子构建而成的:内部算子将N个代理在个体任务上的努力分配映射为任务评分,外部算子则将M项任务评分合并成全局团队奖励。我们证明了这些操作符的曲率决定了异质性是否能够增加回报,并且对于广泛的奖励家族而言,这简化成了一个简单的凸性测试。 接下来,当实体化的时间延长的智能体必须学习努力分配策略时,什么会激励异质性的出现?为了研究这种设置下的异质性,我们采用多智能体强化学习(MARL)作为计算范式,并引入了异质环境设计(HED),这是一个基于梯度的算法,它优化了未指定的MARL环境的参数空间以找到异质性具有优势的情景。矩阵游戏和实体化的多目标捕获环境中进行的实验表明,尽管在设置上有所不同,但HED能够重新发现由我们的理论预测的可以最大化异质性优势的奖励制度,这既验证了HED的有效性,又将我们的理论见解与MARL中的奖励设计联系起来。 这些结果共同帮助我们理解行为多样性何时能带来可测量的好处。
https://arxiv.org/abs/2506.09434
Recent improvements in large language models (LLMs) have led many researchers to focus on building fully autonomous AI agents. This position paper questions whether this approach is the right path forward, as these autonomous systems still have problems with reliability, transparency, and understanding the actual requirements of human. We suggest a different approach: LLM-based Human-Agent Systems (LLM-HAS), where AI works with humans rather than replacing them. By keeping human involved to provide guidance, answer questions, and maintain control, these systems can be more trustworthy and adaptable. Looking at examples from healthcare, finance, and software development, we show how human-AI teamwork can handle complex tasks better than AI working alone. We also discuss the challenges of building these collaborative systems and offer practical solutions. This paper argues that progress in AI should not be measured by how independent systems become, but by how well they can work with humans. The most promising future for AI is not in systems that take over human roles, but in those that enhance human capabilities through meaningful partnership.
近期大型语言模型(LLM)的改进促使许多研究人员专注于构建完全自主的人工智能代理。然而,本文提出质疑:这种发展方向是否正确,因为这些自主系统仍然在可靠性、透明度以及理解人类实际需求方面存在问题。我们建议采取不同的方法:基于大型语言模型的人机协作系统(LLM-HAS),在这种系统中,人工智能与人类合作而非替代人类工作。 通过让人类参与提供指导、回答问题和保持控制权,这类系统可以更加值得信赖且适应性强。本文以医疗保健、金融和软件开发领域的例子来展示人机团队如何比单独的人工智能更好地处理复杂任务。同时,我们还讨论了构建这些协作系统的挑战,并提供了实用的解决方案。 本文认为,衡量人工智能发展的进步不应仅看其独立性有多强,而应关注其与人类合作的能力有多好。最具前景的人工智能未来不在取代人类角色的系统上,而在通过有意义的合作增强人类能力的系统中实现。
https://arxiv.org/abs/2506.09420
This position paper proposes a fundamental shift in designing code generation models: treating reasoning depth as a controllable resource. Rather than being an incidental byproduct of prompting, we argue that the trade-off between rapid, direct answers ("fast thinking") and elaborate, chain-of-thought deliberation ("slow thinking") must be explicitly managed. We contend that optimizing reasoning budgets across the entire model lifecycle - from synthetic data creation and benchmarking to real-world deploymen - can unlock superior trade-offs among accuracy, latency, and cost. This paper outlines how adaptive control over reasoning can enrich supervision signals, motivate new multi-dimensional benchmarks, and inform cost-aware, security-conscious deployment policies. By viewing fast and slow thinking as complementary modes to be scheduled, we envision coding agents that think deep when necessary and act fast when possible.
这篇立场论文提出了一种设计代码生成模型的根本性转变:将推理深度视为一种可控资源。我们认为,快速直接回答(“快思考”)与复杂多步骤的推断过程(“慢思考”)之间的权衡不应仅仅作为提示的副产品出现,而是必须被明确管理。我们主张在整个模型生命周期中优化推理预算——从合成数据创建和基准测试到实际部署——能够解锁更优越的准确性、延迟时间和成本之间的一系列最佳平衡点。本文概述了如何通过适应性控制推理过程来丰富监督信号,激发新的多维度基准,并为具有成本意识和安全性的部署策略提供信息支持。通过将“快思考”和“慢思考”视为互补模式进行调度,我们构想出能够根据需要深入思考并在可能的情况下迅速行动的编码代理。
https://arxiv.org/abs/2506.09396
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at this https URL.
自主代理的出现通过使用自然语言作为强有力的中介,正在改变与图形用户界面(GUI)互动的方式。尽管当前用于实现空间定位的GUI代理主要依赖于监督微调(SFT)方法,但这些方法由于难以准确感知位置数据而面临重大挑战。现有的策略,如强化学习,往往无法有效评估位置准确性,从而限制了它们的应用范围。为此,我们引入了一种新的方法——位置偏好优化(LPO),该方法利用位置数据来优化互动偏好。LPO通过关注信息丰富的区域并使用信息熵来预测互动位置。此外,它还基于物理距离提出了一个动态的位置奖励函数,体现了不同交互位置的重要性差异。借助于组相对偏好优化(GRPO)的支持,LPO促进了GUI环境的广泛探索,并显著提高了交互精度。全面的实验表明,LPO表现出色,在离线基准测试和现实世界的在线评估中均达到了最先进的结果。我们的代码将在不久后公开发布,详情请见此链接:[https URL]。
https://arxiv.org/abs/2506.09373