As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
随着基于大型语言模型(LLM)的代理越来越多地在具有实际后果的重要领域运行,确保其行为安全变得至关重要。目前主导的监管范式是“LLM作为裁判”方法,面临着一个根本性的困境:如何让概率系统可靠地监督其他概率系统而不继承它们的失败模式?我们主张形式验证提供了一种有原则的方法来摆脱这种困境,但其采用受到了一个重要瓶颈的阻碍:从自然语言需求到正式规范的翻译。本文通过提出一个神经符号框架解决了这一差距问题,该框架采用了双向“思想形式”架构:“LLM作为规格编译器”,它们自上而下地将高层次的人类意图分解为原子、可验证的约束条件,然后自下而上地使用Dafny规范和Z3理论可满足性求解来证明符合性,这会产生数学保证而不是概率评分。我们在涵盖行为安全、跨领域限制遵循以及代理向上欺骗检测三个基准测试中对进行了验证。在7个代理模型上的实验表明,相比“LLM作为裁判”基线,平均改进了16.6%,它还能够实现从弱到强的泛化,在此过程中一个7B裁判可以通过来自72B代理的数据以超过90%的准确性检测欺骗,并且通过迭代细化提供了接近线性的安全性改进。
https://arxiv.org/abs/2602.11136
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
尽管在编码代理方面取得了快速进展,但其多模态对应物的进步却相对滞后。其中一个关键挑战是缺乏结合软件开发复杂性和深度多模态理解需求的评估测试平台。游戏开发提供了一个这样的测试环境:代理必须在处理着色器、精灵和动画等内在多模态资产的同时,在一个视觉游戏场景中导航庞大的代码库。我们介绍了GameDevBench,这是用于评估代理在游戏中开发任务上表现的第一个基准。 GameDevBench由132个源自网页和视频教程的任务组成。这些任务需要显著的多模态理解能力并且相当复杂——平均解决方案所需的代码行数和文件更改量是先前软件开发基准的三倍以上。当前,最优秀的代理也只能解决54.5%的任务。 我们发现感知到的任务难度与多模态复杂性之间存在强烈的关联:成功率为46.9%的游戏玩法任务,在2D图形任务中下降至31.6%。 为了提高多模态能力,我们引入了两种基于图像和视频的简单反馈机制。尽管这些方法很简单,但它们可以持续提升性能——最大的改进来自Claude Sonnet 4.5的表现从33.3%增加到47.7%。 我们公开发布了GameDevBench以支持进一步的研究,特别是关于代理游戏开发领域的研究。
https://arxiv.org/abs/2602.11103
Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
代理式编码要求代理能够有效地与运行时环境(例如命令行界面 CLI)进行交互,以完成诸如解决依赖问题、修复系统问题等任务。然而,如何在大规模上获取此类环境密集型任务以增强代理的能力仍然研究不足。为了解决这个问题,我们基于 Dockerfile 和代理任务之间的类比,提出让代理模拟和探索环境历史,并通过执行反馈进行指导的方法。通过追踪健康环境的历史记录,可以将其状态回溯到带有运行时故障的早期状态,在该状态下可以通过打包错误的状态及其相应的错误消息来派生出一项任务。 使用这种方法,我们构建了一个名为 CLI-Gym 的系统,总共衍生出了 1,655 个环境密集型任务,这是同类集合中最大的。此外,通过精心挑选的成功轨迹,我们的微调模型 LiberCoder 在 Terminal-Bench 上取得了绝对改进 +21.1%(达到 46.1%)的显著提升,优于多种强大的基线方法。据我们所知,这是第一个用于大规模导出环境密集型任务的公开流水线。
https://arxiv.org/abs/2602.10999
Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
基于块的编程环境(如Scratch)在低代码教育中扮演着核心角色,然而评估通过图形用户界面(GUI)构建程序的人工智能代理的能力仍然鲜有研究。我们引入了ScratchWorld这一基准测试,旨在评估多模态GUI代理在Scratch中的构造式编程任务上的表现。 该基准测试基于“使用-修改-创造”教学框架设计,涵盖了83个精心挑选的任务,跨越四个不同的问题类别:创建(Create)、调试(Debug)、扩展(Extend)和计算(Compute)。为了严格诊断代理失败的原因,基准测试采用了两种互补的交互模式:原始模式需要进行细致入微的拖放操作以直接评估视觉-运动控制能力;而复合模式则使用高层次语义API来区分程序推理与GUI执行。为确保可靠的评估,我们提出了一种基于执行的评估协议,在此协议中通过浏览器环境内的运行时测试验证所构建的Scratch程序的功能正确性。 跨多种先进多模态语言模型和GUI代理进行广泛实验后发现,存在显著的认知行动差距:尽管具有强大的规划能力,但在精细粒度的GUI操作方面仍面临持续挑战。
https://arxiv.org/abs/2602.10814
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
我们引入了Step 3.5 Flash,这是一种稀疏的专家混合模型(MoE),它在前沿级别的代理智能和计算效率之间架起了桥梁。我们的重点在于构建代理时最重要的两个方面:敏锐的推理能力和快速、可靠的执行能力。Step 3.5 Flash 结合了一个基础的1960亿参数模型与110亿个活跃参数,以实现高效的推断过程。它通过交错使用3:1滑动窗口/全注意力机制和多令牌预测(MTP-3)进行优化,从而减少多轮代理交互中的延迟和成本。 为了达到前沿级别的智能,我们设计了一个可扩展的强化学习框架,该框架结合了可验证信号与偏好反馈,并且能够在大规模离策略训练下保持稳定,使得在数学、代码和工具使用方面能够持续自我改进。Step 3.5 Flash 在代理任务、编程任务和数学任务中表现出色,在IMO-AnswerBench上得分为85.4%,LiveCodeBench-v6(2024.08-2025.05)得分为86.4%,tau2-Bench得分高达88.2%,在BrowseComp(带上下文管理)任务中得分为69.0%,以及在Terminal-Bench 2.0中的成绩为51.0%。这些结果与前沿模型如GPT-5.2 xHigh和Gemini 3.0 Pro相当。 通过重新定义效率边界,Step 3.5 Flash 为在现实世界工业环境中部署复杂代理提供了一个高密度的基础框架。
https://arxiv.org/abs/2602.10604
Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.
深度强化学习(DRL)在训练和执行过程中可能会探索不可行的动作。现有方法通常假设存在一种符号接地函数,该函数将高维状态映射到一致的符号表示,并且还使用手动指定的动作屏蔽技术来约束动作。本文中,我们提出了一种新颖的框架——神经符号行动掩码(NSAM),它能够以最少监督的方式,在DRL过程中自动学习与给定领域内的高维状态约束相一致的符号模型。基于学到的状态符号模型,NSAM 学习出一种规则,用以排除不可行动作。 NSAM 使得符号推理和深度策略优化之间的端到端集成成为可能,其中符号接地和策略学习方面的改进相互强化。我们在多个具有约束条件的领域中对 NSAM 进行了评估,并通过实验结果证明,NSAM 显著提高了 DRL 代理人的样本效率,并且大幅减少了违反约束的情况发生次数。
https://arxiv.org/abs/2602.10598
While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.
虽然在处理长文本上下文时进行推理对于许多现实世界的应用至关重要,但对于大型语言模型(LLMs)来说却是一个挑战。随着上下文长度的增加,这些模型的表现会逐渐下降。最近的工作MemAgent试图通过采用类似RNN的方式逐段处理上下文,并更新文本内存来最终提供答案,以此解决这一问题。然而,这种简单的递归记忆更新方式存在两个关键缺陷:(i)由于它可能在缺乏依据的情况下更新信息,因此记忆可能会迅速膨胀;(ii)循环中缺少退出机制,在收集到足够证据后仍进行不必要的计算。 为了解决这些问题,我们提出了GRU-Mem模型,该模型通过引入两种由文本控制的门来实现更稳定和高效的长上下文推理。具体来说,在GRU-Mem中,内存仅在更新门开启时才会被更新,并且一旦退出门开启,递归循环将立即结束。 为了赋予模型这些能力,我们提出了两个奖励信号$r^{\text{update}}$和$r^{\text{exit}}$,在端到端的强化学习框架内给予正确的更新行为和退出行为相应的奖励。实验表明,在各种长上下文推理任务上,GRU-Mem的有效性和效率均优于原始的MemAgent模型,并且推理速度提高了高达400%。
https://arxiv.org/abs/2602.10560
Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
长期工作流程代理在长时间内有效运行对于真正自主系统的实现至关重要。它们的可靠执行依赖于解决含糊不清情况的能力,这些情况下必须寻求澄清以确保任务正确执行。然而,由于缺乏可扩展且与特定任务无关的框架来系统地整理和测量定制工作流中的模糊性影响,这一领域的发展受到了限制。 为了填补这一空白,我们引入了LHAW(Long-Horizon Augmented Workflows),这是一种模块化、不依赖于数据集的合成管道。它通过在四个维度——目标、约束条件、输入和背景信息上按可配置严重程度系统地移除信息,将任何定义明确的任务转换为可控且未充分指定的变体。 与依赖大型语言模型(LLM)对模糊性进行预测的方法不同,LHAW通过对代理的实际试验来验证这些变体,并根据观察到的目标状态差异将其分类为关键结果、发散或良性。我们根据我们的分类法发布了来自TheAgentCompany、SWE-Bench Pro和MCP-Atlas的285个任务变种,并进行了正式分析以衡量当前代理如何在模糊环境下检测、推理和解决未充分指定的问题。 LHAW提供了第一个系统框架,用于从成本敏感的角度评估代理澄清行为在长时间工作流程中的表现,从而促进可靠自主系统的开发。
https://arxiv.org/abs/2602.10525
While single-agent legged locomotion has witnessed remarkable progress, individual robots remain fundamentally constrained by physical actuation limits. To transcend these boundaries, we introduce Co-jump, a cooperative task where two quadrupedal robots synchronize to execute jumps far beyond their solo capabilities. We tackle the high-impulse contact dynamics of this task under a decentralized setting, achieving synchronization without explicit communication or pre-specified motion primitives. Our framework leverages Multi-Agent Proximal Policy Optimization (MAPPO) enhanced by a progressive curriculum strategy, which effectively overcomes the sparse-reward exploration challenges inherent in mechanically coupled systems. We demonstrate robust performance in simulation and successful transfer to physical hardware, executing multi-directional jumps onto platforms up to 1.5 m in height. Specifically, one of the robots achieves a foot-end elevation of 1.1 m, which represents a 144% improvement over the 0.45 m jump height of a standalone quadrupedal robot, demonstrating superior vertical performance. Notably, this precise coordination is achieved solely through proprioceptive feedback, establishing a foundation for communication-free collaborative locomotion in constrained environments.
尽管单个腿足机器人的运动取得了显著进展,但个体机器人仍然受到物理驱动限制的根本约束。为了突破这些界限,我们引入了Co-jump,这是一个合作任务,在该任务中两个四足机器人同步执行跳跃动作,超越它们单独操作时的能力。在分散式设置下,我们解决了此任务中的高脉冲接触动力学问题,实现了无需显式通信或预定义运动原语的同步。 我们的框架采用了多代理近端策略优化(MAPPO),并结合了逐步递进的教学策略,这有效地克服了机械耦合系统中固有的稀疏奖励探索挑战。我们在模拟和物理硬件上展示了稳健的性能,在高达1.5米高度的平台上执行多方向跳跃动作。特别地,其中一个机器人实现了足端提升至1.1米的高度,这一成就比单独四足机器人的0.45米跳跃高度提高了144%,显示了其卓越的垂直运动能力。 值得注意的是,这种精确协调完全依赖于内感受反馈(即自身状态感知),为在受限环境中实现无需通信的合作移动奠定了基础。
https://arxiv.org/abs/2602.10514
Large Language Model (LLM) applications are vulnerable to prompt injection and context manipulation attacks that traditional security models cannot prevent. We introduce two novel primitives--authenticated prompts and authenticated context--that provide cryptographically verifiable provenance across LLM workflows. Authenticated prompts enable self-contained lineage verification, while authenticated context uses tamper-evident hash chains to ensure integrity of dynamic inputs. Building on these primitives, we formalize a policy algebra with four proven theorems providing protocol-level Byzantine resistance--even adversarial agents cannot violate organizational policies. Five complementary defenses--from lightweight resource controls to LLM-based semantic validation--deliver layered, preventative security with formal guarantees. Evaluation against representative attacks spanning 6 exhaustive categories achieves 100% detection with zero false positives and nominal overhead. We demonstrate the first approach combining cryptographically enforced prompt lineage, tamper-evident context, and provable policy reasoning--shifting LLM security from reactive detection to preventative guarantees.
大型语言模型(LLM)应用程序容易受到传统安全模型无法防范的提示注入和上下文操纵攻击。我们引入了两种新颖的基本元素——认证提示和认证上下文,它们在LLM工作流程中提供了可被密码学验证的来源追溯性。认证提示能够进行自我封闭的血统验证,而认证上下文则使用证据明显的哈希链来确保动态输入的完整性。 基于这些基本元素,我们正式定义了一种政策代数,并通过四个已证明定理提供了协议级别的拜占庭容错——即使对抗性的代理也无法违反组织政策。五种互补防御措施——从轻量级资源控制到LLM语义验证——提供了具有形式保证的分层预防性安全。 针对跨越六个详尽类别的代表性攻击进行评估,该系统实现了100%的检测率,并且没有误报和几乎不增加任何开销。我们展示了第一个结合密码学强制执行提示血统、证据明显的上下文以及可证明政策推理的方法——将LLM的安全性从被动响应转变为预防性保证。 总的来说,这段文字描述了一种新的方法来增强大型语言模型的防御能力,并提供正式化的安全保证,防止恶意攻击者违反系统的组织政策。这种方法通过结合密码学技术、动态输入完整性检查和策略验证,从根本上提高了LLM系统的安全性。
https://arxiv.org/abs/2602.10481
Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present an utility feedback centric framework. Our contributions are: (i) AgoraBench, a new benchmark spanning nine challenging settings (e.g., deception, monopoly) that supports diverse strategy modeling; (ii) human-aligned, economically grounded metrics derived from utility theory. This is operationalized via agent utility, negotiation power, and acquisition ratio that implicitly measure how well the negotiation aligns with human preference and (iii) a human preference grounded dataset with learning pipeline that strengthens LLMs' bargaining ability through both prompting and finetuning. Empirical results indicate that baseline LLM strategies often diverge from human preferences, while our mechanism substantially improves negotiation performance, yielding deeper strategic behavior and stronger opponent awareness.
讨价还价通常被视为一个逻辑领域,而非艺术或直觉的体现。然而,大型语言模型(LLM)在处理这一问题时仍面临挑战,因为它们的战略深度有限且难以适应复杂的社会因素。当前的标准很少能捕捉到这种限制。为了弥补这一差距,我们提出了一种以效用反馈为中心的框架。我们的贡献包括: (i) AgoraBench:一个新的基准测试集,涵盖九个具有挑战性的场景(如欺骗、垄断等),支持多样化策略建模; (ii) 从效用理论中衍生出来的人类一致且有经济依据的评估指标。这一标准通过代理效用、谈判能力以及收购比率隐性地衡量了协商是否符合人类偏好; (iii) 一个以人类偏好为基础的数据集和学习管道,通过提示和微调增强LLM的议价能力。 实证结果表明,基线LLM策略往往与人类偏好相悖。而我们的机制显著提高了谈判表现,使模型表现出更深层次的战略行为并增强了对手意识。
https://arxiv.org/abs/2602.10467
Agentic AI systems automate enterprise workflows but existing defenses--guardrails, semantic filters--are probabilistic and routinely bypassed. We introduce authenticated workflows, the first complete trust layer for enterprise agentic AI. Security reduces to protecting four fundamental boundaries: prompts, tools, data, and context. We enforce intent (operations satisfy organizational policies) and integrity (operations are cryptographically authentic) at every boundary crossing, combining cryptographic elimination of attack classes with runtime policy enforcement. This delivers deterministic security--operations either carry valid cryptographic proof or are rejected. We introduce MAPL, an AI-native policy language that expresses agentic constraints dynamically as agents evolve and invocation context changes, scaling as O(log M + N) policies versus O(M x N) rules through hierarchical composition with cryptographic attestations for workflow dependencies. We prove practicality through a universal security runtime integrating nine leading frameworks (MCP, A2A, OpenAI, Claude, LangChain, CrewAI, AutoGen, LlamaIndex, Haystack) through thin adapters requiring zero protocol modifications. Formal proofs establish completeness and soundness. Empirical validation shows 100% recall with zero false positives across 174 test cases, protection against 9 of 10 OWASP Top 10 risks, and complete mitigation of two high impact production CVEs.
代理型AI系统自动化了企业的业务流程,但现有的防御措施(如防护栏、语义过滤器)是概率性的,并且经常被绕过。我们引入认证工作流,这是第一个针对企业代理型AI的完整信任层。安全性简化为保护四个基本边界:提示(prompts)、工具(tools)、数据(data)和上下文(context)。我们在每个边界的转换中强制执行意图(操作符合组织政策)和完整性(操作具有加密真实性),结合了加密消除攻击类与运行时策略执行,从而提供确定性安全——操作要么携带有效的加密证明,要么被拒绝。 我们介绍了MAPL,一种原生的AI策略语言,能够动态地表达代理约束,随着代理的发展以及调用上下文的变化而变化。通过层次化的组合和工作流依赖性的加密认证,它可以实现O(log M + N)级别的政策扩展,而不是传统的O(M x N)规则。 我们通过一个通用的安全运行时证明了其实用性,该运行时整合了九个领先的框架(MCP、A2A、OpenAI、Claude、LangChain、CrewAI、AutoGen、LlamaIndex和Haystack),并通过轻量级适配器实现,无需对协议进行任何修改。正式的证明确立了完整性和正确性。 通过174个测试用例的实证验证显示,在零误报的情况下实现了100%的召回率,并且能够抵御十大OWASP风险中的九项和两项高影响生产CVE漏洞的完全缓解。
https://arxiv.org/abs/2602.10465
AIvilization v0 is a publicly deployed large-scale artificial society that couples a resource-constrained sandbox economy with a unified LLM-agent architecture, aiming to sustain long-horizon autonomy while remaining executable under rapidly changing environment. To mitigate the tension between goal stability and reactive correctness, we introduce (i) a hierarchical branch-thinking planner that decomposes life goals into parallel objective branches and uses simulation-guided validation plus tiered re-planning to ensure feasibility; (ii) an adaptive agent profile with dual-process memory that separates short-term execution traces from long-term semantic consolidation, enabling persistent yet evolving identity; and (iii) a human-in-the-loop steering interface that injects long-horizon objectives and short commands at appropriate abstraction levels, with effects propagated through memory rather than brittle prompt overrides. The environment integrates physiological survival costs, non-substitutable multi-tier production, an AMM-based price mechanism, and a gated education-occupation system. Using high-frequency transactions from the platforms mature phase, we find stable markets that reproduce key stylized facts (heavy-tailed returns and volatility clustering) and produce structured wealth stratification driven by education and access constraints. Ablations show simplified planners can match performance on narrow tasks, while the full architecture is more robust under multi-objective, long-horizon settings, supporting delayed investment and sustained exploration.
AIvilization v0 是一个公开部署的大型人工社会,它将资源受限的沙盒经济与统一的大规模语言模型-代理架构相结合,旨在保持在快速变化环境中的长期自主性。为了解决目标稳定性和即时正确性之间的矛盾,我们引入了以下三个机制: (i) 一种层次化的分枝思考规划器,它可以将生活目标分解成平行的目标分支,并使用模拟引导验证加上分级重新规划来确保可行性; (ii) 一个具有双过程记忆的自适应代理配置文件,它将短期执行轨迹与长期语义巩固分开,从而实现了持久但不断发展的身份特征; (iii) 一种人机交互控制接口,在适当抽象级别上注入长期目标和短命令,并通过内存传播效果,而不是通过脆弱的提示覆盖。 该环境整合了生理生存成本、不可替代的多层级生产体系、基于自动做市商(AMM)的价格机制以及受教育程度影响的职业准入系统。利用成熟阶段平台上的高频交易数据,我们发现稳定的市场能够重现关键特征事实(如回报分布的长尾效应和波动性集聚),并且会形成由教育及获取约束驱动的结构化财富分层。 简化版本的规划器在处理狭窄任务时可以匹配性能,而完整架构则更适用于多目标、长期视野设置,在延迟投资与持续探索的支持方面表现得更为稳健。
https://arxiv.org/abs/2602.10429
Adapting large language models (LLMs) trained on broad organic chemistry to smaller, domain-specific reaction datasets is a key challenge in chemical and pharmaceutical R&D. Effective specialisation requires learning new reaction knowledge while preserving general chemical understanding across related tasks. Here, we evaluate Low-Rank Adaptation (LoRA) as a parameter-efficient alternative to full fine-tuning for organic reaction prediction on limited, complex datasets. Using USPTO reaction classes and challenging C-H functionalisation reactions, we benchmark forward reaction prediction, retrosynthesis and reagent prediction. LoRA achieves accuracy comparable to full fine-tuning while effectively mitigating catastrophic forgetting and better preserving multi-task performance. Both fine-tuning approaches generalise beyond training distributions, producing plausible alternative solvent predictions. Notably, C-H functionalisation fine-tuning reveals that LoRA and full fine-tuning encode subtly different reactivity patterns, suggesting more effective reaction-specific adaptation with LoRA. As LLMs continue to scale, our results highlight the practicality of modular, parameter-efficient fine-tuning strategies for their flexible deployment for chemistry applications.
将广泛应用于有机化学的大型语言模型(LLMs)适应到较小、特定领域的反应数据集中,是化学和制药研发中的一个关键挑战。有效专业化需要学习新的反应知识,同时保持对相关任务的一般性化学理解。在这里,我们评估了低秩适配(LoRA)作为一种参数高效的替代方法,在有限且复杂的有机反应预测数据集上进行全量微调。使用USPTO反应类别和具有挑战性的C-H官能化反应,我们在正向反应预测、逆合成分析和试剂预测方面进行了基准测试。LoRA在准确性上与全量微调相当,并有效减轻了灾难性遗忘问题,同时更好地保持了多任务性能。这两种微调方法都能超越训练分布范围,生成合理的替代溶剂预测。值得注意的是,在C-H官能化微调中发现,LoRA和全量微调编码了细微不同的反应模式,表明使用LoRA进行特定反应适应更为有效。 随着LLMs的持续扩展,我们的研究结果强调了模块化、参数高效的微调策略在化学应用灵活部署中的实用性。
https://arxiv.org/abs/2602.10404
Full models of the world require complex knowledge of immense detail. While pre-trained large models have been hypothesized to contain similar knowledge due to extensive pre-training on vast amounts of internet scale data, using them directly in a search procedure is inefficient and inaccurate. Conversely, partial models focus on making high quality predictions for a subset of state and actions: those linked through affordances that achieve user intents~\citep{khetarpal2020can}. Can we posit large models as partial world models? We provide a formal answer to this question, proving that agents achieving task-agnostic, language-conditioned intents necessarily possess predictive partial-world models informed by affordances. In the multi-task setting, we introduce distribution-robust affordances and show that partial models can be extracted to significantly improve search efficiency. Empirical evaluations in tabletop robotics tasks demonstrate that our affordance-aware partial models reduce the search branching factor and achieve higher rewards compared to full world models.
世界全模型需要复杂且详细的知识。尽管大规模预训练模型因其在海量互联网数据上的广泛预训练而被认为可能包含类似知识,但直接将它们用于搜索程序中却效率低下且不够准确。相反,部分模型则侧重于为状态和动作的子集进行高质量预测:这些子集通过实现用户意图的能力相联系。我们能否将大规模模型视为部分世界模型呢?对于这一问题,本文提供了一个正式的答案,证明了能够完成与任务无关、受语言条件影响的意图的代理必定拥有基于能力(affordances)的信息的部分世界预测模型。 在多任务设置中,我们引入了分布鲁棒性能力,并展示可以通过提取部分模型来显著提高搜索效率。通过桌面机器人任务中的实证评估,我们的基于能力的部分模型降低了搜索分支因子,并且相较于全世界模型取得了更高的奖励分数。
https://arxiv.org/abs/2602.10390
The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.
在高风险临床环境中部署大型语言模型(LLMs)需要严格的和可靠的评估。然而,现有的医学基准测试仍然静态不变,并且面临两个关键限制:(1) 数据污染,即测试集意外地泄露到训练语料库中,导致性能估计过高;以及 (2) 时间错位,无法捕捉医学知识的快速演变。此外,当前针对开放性临床推理评估的指标通常依赖于浅层词汇重叠(如ROUGE)或基于主观评价的LLM作为裁判评分系统,这两种方法都无法充分验证临床正确性。 为了填补这些空白,我们引入了LiveMedBench,这是一个不断更新、无污染且基于准则的基准测试平台。该平台每周从在线医学社区收集真实世界中的临床案例,确保严格的时间隔离以与模型训练数据保持独立。我们还提出了一种多代理临床筛选框架来过滤原始数据噪声,并验证临床完整性是否符合基于证据的医疗原则。在评估方面,我们开发了一个自动化评分准则评估框架,将医生的回答分解为具体、特定于个案的标准,从而比LLM作为裁判的方法更紧密地与专家医师的意见相一致。 目前,LiveMedBench包括2,756个来自38个医学专科和多种语言的真实世界案例,并附有16,702条独特的评估标准。对38种LLMs的广泛评测显示,即使表现最佳的模型也只能达到39.2%的成绩,而84%的模型在截止日期后的案例中性能下降,这证实了数据污染风险普遍存在。错误分析进一步揭示,上下文应用能力(而非事实性知识)是主要瓶颈,大约有35%-48%的失败源于无法将医学知识与患者特定限制相匹配的能力不足。
https://arxiv.org/abs/2602.10367
As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.
随着大型语言模型(LLM)在社交和战略场景中的应用越来越广泛,理解它们的行为与人类行为的差异变得至关重要。虽然行为博弈论(BGT)为分析行为提供了一个框架,但现有的模型并不能完全捕捉到人类的独特行为或像LLM这样的黑箱非人类代理人的行为。我们采用了一种前沿的程序发现工具——AlphaEvolve,直接从数据中发现可解释的人类和LLM的行为模型,从而能够开放性地探索驱动人类与LLM在战略互动中的结构性因素。通过对反复进行的石头、剪刀、布游戏的分析,我们发现前沿的大型语言模型可以展现出比人类更深的战略行为能力。这些结果为理解结构差异如何导致人类与LLM在策略互动中表现的不同奠定了基础。
https://arxiv.org/abs/2602.10324
Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at this https URL.
奖励塑形已被广泛应用于加速强化学习(RL)代理的训练。然而,设计有效的奖励塑形函数的方法,尤其是对于复杂的连续控制问题,仍然缺乏系统的解释和说明。在这项工作中,我们提出了一种方法,可以从离线数据集中自动学习针对连续控制问题的有效奖励塑形函数,即使这些数据集可能受到未被观察到的混淆变量的影响。具体来说,我们的方法基于最近提出的因果贝尔曼方程来学习最优状态值的一个紧密上界,并将其用作潜在基础奖励塑形(PBRS)框架中的潜力。 我们提出的一种奖励塑形算法在多个常用的连续控制基准测试中与软演员评论家(SAC)一起使用时展示了强大且具有保障性的性能,即使存在未被观察到的混淆因素也是如此。更广泛地说,我们的工作标志着从因果角度出发进行鲁棒性处理、应对未知干扰变量的连续控制系统开发的第一步。 训练我们奖励塑形函数的相关代码可以在这个URL找到:[提供实际的链接地址](原文中提到的是"this https URL",由于隐私和安全原因,在这里不直接显示该URL)。
https://arxiv.org/abs/2602.10305
Optimizing large-scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a non-trivial task, traditionally relying on extensive manual iterations to test new hypotheses. We propose a self-evolving system that leverages Large Language Models (LLMs), specifically those from Google's Gemini family, to autonomously generate, train, and deploy high-performing, complex model changes within an end-to-end automated workflow. The self-evolving system is comprised of an Offline Agent (Inner Loop) that performs high-throughput hypothesis generation using proxy metrics, and an Online Agent (Outer Loop) that validates candidates against delayed north star business metrics in live production. Our agents act as specialized Machine Learning Engineers (MLEs): they exhibit deep reasoning capabilities, discovering novel improvements in optimization algorithms and model architecture, and formulating innovative reward functions that target long-term user engagement. The effectiveness of this approach is demonstrated through several successful production launches at YouTube, confirming that autonomous, LLM-driven evolution can surpass traditional engineering workflows in both development velocity and model performance.
优化大规模机器学习系统,例如为全球视频平台设计推荐模型,需要在庞大的超参数搜索空间中导航,并且更关键的是,要设计复杂的优化器、架构和奖励函数以捕捉细微的用户行为。在这个领域实现显著改进是一项非同小可的任务,传统上这依赖于大量的手动迭代来测试新的假设。 我们提出了一种自进化系统,该系统利用大型语言模型(LLM),特别是来自谷歌Gemini家族的语言模型,自动生成、训练和部署高性能的复杂模型更改,在端到端自动化工作流中运行。这种自进化系统由离线代理(内循环)组成,它使用替代指标进行高吞吐量假设生成,并在线代理(外循环)验证候选方案以与实时生产中的延迟北极星商业指标匹配。 我们的代理扮演了专门的机器学习工程师(MLE)的角色:它们展示出深度推理能力,发现优化算法和模型架构的新颖改进,并制定创新性的奖励函数以针对长期用户参与。这种方法的有效性通过YouTube上的几次成功生产发布得到了证实,证明自主、由LLM驱动的进化不仅在开发速度上超越了传统的工程工作流程,在模型性能方面也表现出色。
https://arxiv.org/abs/2602.10226
Quantum operations with indefinite causal order (ICO) represent a framework in quantum information processing where the relative order between two events can be indefinite. In this paper, we investigate whether sensing and computation, two canonical tasks in quantum information processing, can be carried out within the ICO framework. We propose a scheme for integrated sensing and computation that uses the same quantum state for both tasks. The quantum state is represented as an agent that performs state observation and learns a function of the state to make predictions via a parametric model. Under an ICO operation, the agent experiences a superposition of orders, one in which it performs state observation and then executes the required computation steps, and another in which the agent carries out the computation first and then performs state observation. This is distinct from prevailing information processing and machine intelligence paradigms where information acquisition and learning follow a strict causal order, with the former always preceding the latter. We provide experimental results and we show that the proposed scheme can achieve small training and testing losses on a representative task in magnetic navigation.
量子不定因果顺序(ICO)操作框架代表了在量子信息处理中,两个事件之间的相对顺序可以不确定的情况。在这篇论文中,我们探讨了传感和计算这两种量子信息处理中的典型任务是否可以在ICO框架下执行。 我们提出了一种集成传感与计算的方案,该方案使用同一量子态来同时完成这两项任务。在这种状态下,量子态被视作一个代理(agent),它执行状态观测并学习关于状态的一个函数以便通过参数模型进行预测。在进行ICO操作时,代理经历的是顺序的叠加:一种情况是它先做状态观测然后执行必要的计算步骤;另一种情况则是首先完成计算然后再进行状态观测。这与现有的信息处理和机器智能范式不同,在后者中,信息获取总是严格遵循因果顺序并始终位于学习过程之前。 我们提供了实验结果,并展示了所提出的方案在磁导航这一代表性任务上可以实现较小的训练和测试损失。
https://arxiv.org/abs/2602.10225