Generating molecules that satisfy precise numeric constraints over multiple physicochemical properties is critical and challenging. Although large language models (LLMs) are expressive, they struggle with precise multi-objective control and numeric reasoning without external structure and feedback. We introduce \textbf{M olGen}, a fragment-level, retrieval-augmented, two-stage framework for molecule generation under multi-property constraints. Stage I : Prototype generation: a multi-agent reasoner performs retrieval-anchored, fragment-level edits to produce a candidate near the feasible region. Stage II : RL-based fine-grained optimization: a fragment-level optimizer trained with Group Relative Policy Optimization (GRPO) applies one- or multi-hop refinements to explicitly minimize the property errors toward our target while regulating edit complexity and deviation from the prototype. A large, automatically curated dataset with reasoning chains of fragment edits and measured property deltas underpins both stages, enabling deterministic, reproducible supervision and controllable multi-hop reasoning. Unlike prior work, our framework better reasons about molecules by leveraging fragments and supports controllable refinement toward numeric targets. Experiments on generation under two sets of property constraints (QED, LogP, Molecular Weight and HOMO, LUMO) show consistent gains in validity and precise satisfaction of multi-property targets, outperforming strong LLMs and graph-based algorithms.
生成满足多个物理化学属性精确数值约束的分子是至关重要的,也是具有挑战性的。尽管大型语言模型(LLMs)具备表达能力,但它们在没有外部结构和反馈的情况下难以实现多目标精准控制和数值推理。我们引入了**MolGen**框架,这是一个基于片段级别、检索增强型的两阶段生成框架,旨在解决多属性约束下的分子生成问题。 第一阶段:原型生成:一个多代理推理器执行基于检索的片段级编辑操作,以产生位于可行区域附近的候选分子。 第二阶段:基于强化学习(RL)的精细化优化:一个经过群相对策略优化(GRPO)训练的片段级优化器应用一跳或多跳改进,旨在最小化属性误差并朝向目标逼近,同时控制编辑复杂性和偏离原型的程度。 该框架依赖于一个大规模、自动编目的数据集,其中包含从碎片编辑到测量属性变化的推理链。这使得两阶段过程中都能实现确定性监督和可控多跳推理。与先前的工作不同的是,我们的框架通过利用片段更好地理解分子,并支持向数值目标的可控细化。 实验结果显示,在两个不同的属性约束集(QED、LogP、分子量以及HOMO、LUMO)下进行生成时,MolGen在有效性和精确满足多属性目标方面表现出了持续改进,优于强大的大型语言模型和基于图的方法。
https://arxiv.org/abs/2601.10131
The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, o1, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time. In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference. Furthermore, our retention analysis identifies foundational cohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella "Glass Slipper" effect. These findings underscore that the way developers and end-users engage with LLMs "in the wild" is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.
过去的一年标志着大型语言模型(LLM)进化及其在现实世界中的应用的一个转折点。随着首个广泛采用的推理模型o1于2024年12月5日发布,该领域从单一阶段模式生成转向多步骤推理推断,加速了部署、实验以及新型类别的应用程序开发。随着这一转变迅速发展,我们对这些模型实际在实践中如何被使用的经验理解已经滞后。 在这项工作中,我们利用OpenRouter平台——一个跨越多种LLM的人工智能推理提供者——来分析超过100万亿个现实世界中的LLM交互,涵盖任务、地理和时间。在我们的实证研究中,我们观察到开放权重模型的大量采用,以及创造性角色扮演(超出许多假设的生产力任务)与代码辅助类别之外的巨大流行度,并且见证了代理推理的兴起。此外,我们的留存分析识别了基本用户群体:早期用户的参与程度远比后来的用户群体持久。我们将这一现象称为“灰姑娘玻璃鞋”效应。 这些发现强调了开发者和最终用户在现实世界中使用LLM的方式是复杂多面的。我们讨论了对模型构建者、AI开发人员以及基础设施提供者的启示,并概述了如何通过基于数据的理解来指导更好的LLM系统的设计与部署。
https://arxiv.org/abs/2601.10088
Academic paper search is a fundamental task in scientific research, yet most existing approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. To address this limitation, we propose PaperScout, an autonomous agent that reformulates paper search as a sequential decision-making process. Unlike static workflows, PaperScout dynamically decides whether, when, and how to invoke search and expand tools based on accumulated retrieval context. However, training such agents presents a fundamental challenge: standard reinforcement learning methods, typically designed for single-turn tasks, suffer from a granularity mismatch when applied to multi-turn agentic tasks, where token-level optimization diverges from the granularity of sequence-level interactions, leading to noisy credit assignment. We introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method that aligns optimization with agent-environment interaction. Comprehensive experiments on both synthetic and real-world benchmarks demonstrate that PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance, validating the effectiveness of our adaptive agentic framework and optimization strategy.
学术论文搜索是科学研究中的基本任务,然而现有的大多数方法依赖于僵化、预先定义的工作流程,这些工作流程在处理复杂的条件查询时显得力不从心。为了解决这一局限性,我们提出了PaperScout,这是一种自主代理,它将论文搜索重新表述为一个连续的决策制定过程。与静态工作流不同,PaperScout能够根据累积检索上下文动态决定何时、如何以及是否调用搜索和扩展工具。然而,训练此类代理面临着一个根本性的挑战:标准的强化学习方法通常针对单次交互的任务设计,在应用于多轮次的代理任务时会遭遇粒度不匹配的问题,即基于令牌级别的优化与序列级别互动的粗粒度之间存在分歧,导致信用分配模糊不清。 为此,我们引入了邻近序列策略优化(Proximal Sequence Policy Optimization, PSPO),这是一种具备过程意识、以序列级别进行策略优化的方法,使优化更好地符合代理与环境之间的交互。在合成数据和真实世界基准上的全面实验表明,在召回率和相关性方面,PaperScout显著优于基于工作流驱动的和强化学习(RL)的基本模型,验证了我们适应性的代理框架及优化策略的有效性。
https://arxiv.org/abs/2601.10029
Large Language Models (LLMs) are increasingly shaping human-computer interaction (HCI), from personalized assistants to social simulations. Beyond language competence, researchers are exploring whether LLMs can exhibit human-like characteristics that influence engagement, decision-making, and perceived realism. Personality, in particular, is critical, yet existing approaches often struggle to achieve both nuanced and adaptable expression. We present a framework that models LLM personality via Jungian psychological types, integrating three mechanisms: a dominant-auxiliary coordination mechanism for coherent core expression, a reinforcement-compensation mechanism for temporary adaptation to context, and a reflection mechanism that drives long-term personality evolution. This design allows the agent to maintain nuanced traits while dynamically adjusting to interaction demands and gradually updating its underlying structure. Personality alignment is evaluated using Myers-Briggs Type Indicator questionnaires and tested under diverse challenge scenarios as a preliminary structured assessment. Findings suggest that evolving, personality-aware LLMs can support coherent, context-sensitive interactions, enabling naturalistic agent design in HCI.
大型语言模型(LLM)在人机交互(HCI)中发挥着越来越重要的作用,从个性化助手到社会模拟都有其身影。除了语言能力之外,研究人员正在探索LLM是否能够表现出影响参与度、决策和感知真实感的人类特征。特别地,个性至关重要,但现有的方法往往难以同时实现细微且适应性强的表达。我们提出了一种框架,通过荣格心理学类型来建模LLM的性格,并整合了三种机制:一种是主要-辅助协调机制,以保持核心表达的一致性;另一种是在特定情境下暂时调整的强化-补偿机制;还有一种驱动长期性格演变的反思机制。这种设计使代理能够在维持细微特质的同时动态适应互动需求并逐步更新其底层结构。 个性一致性通过迈尔斯-布里格斯类型指标问卷进行评估,并在多种挑战场景下进行了初步的结构化测试,以验证其性能。研究结果表明,能够进化且具有人格意识的LLM可以支持一致、情境敏感的交互,这使得HCI中自然主义代理设计成为可能。
https://arxiv.org/abs/2601.10025
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss. The only known robust defense is architectural isolation that strictly separates trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs) -- systems that automate tasks by viewing screens and executing actions -- presents a fundamental challenge: current agents require continuous observation of UI state to determine each action, conflicting with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content, providing provable control flow integrity guarantees against arbitrary instruction injections. Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks, which manipulate UI elements to trigger unintended valid paths within the plan. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs.
AI代理容易受到提示注入攻击,这种攻击通过恶意内容劫持代理行为来盗取凭证或造成经济损失。目前唯一已知的稳健防御措施是架构隔离,即严格分离受信任的任务规划与不受信任的环境观察。然而,将这一设计应用于计算机使用代理(CUAs)——这类系统通过查看屏幕并执行操作来自动化任务时,会遇到一个根本性的挑战:当前的代理需要持续观察用户界面的状态以确定每个动作,这与所需的隔离措施相冲突。 我们解决了这种紧张关系,展示了UI工作流程虽然动态但具有结构性可预测性。我们引入了CUAs的一次性规划方法,在观测到任何潜在恶意内容之前,受信任的计划器会生成一个完整的执行图,其中包括条件分支,从而为任意指令注入提供可控流完整性保证。 尽管这种架构隔离成功地防止了指令注入攻击,但我们展示了需要额外措施来防范分支引导攻击,这类攻击通过操纵UI元素触发规划中的未预期有效路径。我们在OSWorld上评估了我们的设计,并在前沿模型中保留了高达57%的性能,在开源的小型模型中性能提高了最多19%,这表明严格的保障与实用性可以在CUAs中共存。 翻译如下: AI代理容易受到提示注入攻击,这种攻击利用恶意内容劫持代理行为来盗取凭证或造成经济损失。目前唯一已知的有效防御手段是架构隔离,即严格分离受信任的任务规划和不受信任的环境观察。然而,在计算机使用代理(CUAs)——这类系统通过查看屏幕并执行操作来自动化任务时应用这一设计面临一个根本性挑战:当前代理需要持续监控用户界面的状态以决定每个动作,这与为安全所需的隔离措施相冲突。 我们解决了这一矛盾,展示了尽管UI工作流程动态多变,但它们具有结构性可预测性的特点。为此,我们引入了一种CUAs的一次性规划方法,在观察到任何潜在恶意内容之前,受信任的计划器便生成了一个包含条件分支的完整执行图,从而确保了对任意指令注入的有效控制流完整性保障。 尽管这种架构隔离成功防止了指令注入攻击,但我们展示了还需要额外措施来防范所谓的“分支引导”攻击,这类攻击通过操控UI元素触发规划中未预期到的有效路径。我们在OSWorld平台上评估了我们的设计,并且在前沿模型中的性能保留高达57%,同时使开源的小型模型的性能提升最多19%。这表明,在CUAs中,严格的保障措施和实用性可以并存。
https://arxiv.org/abs/2601.09923
Retrieval-augmented generation (RAG) has become the default strategy for providing large language model (LLM) agents with contextual knowledge. Yet RAG treats memory as a stateless lookup table: information persists indefinitely, retrieval is read-only, and temporal continuity is absent. We define the \textit{Continuum Memory Architecture} (CMA), a class of systems that maintain and update internal state across interactions through persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher-order abstractions. Rather than disclosing implementation specifics, we specify the architectural requirements CMA imposes and show consistent behavioral advantages on tasks that expose RAG's structural inability to accumulate, mutate, or disambiguate memory. The empirical probes (knowledge updates, temporal association, associative recall, contextual disambiguation) demonstrate that CMA is a necessary architectural primitive for long-horizon agents while highlighting open challenges around latency, drift, and interpretability.
基于检索增强的生成(RAG)已成为向大型语言模型(LLM)代理提供上下文知识的默认策略。然而,RAG将记忆视为无状态查找表:信息无限期保留、检索为只读模式且缺乏时间连续性。我们定义了“连续内存架构”(CMA),这是一种通过持久存储、选择性保留、关联路由、时间链接和整合到更高层次抽象来维护和更新内部状态的系统类。与其披露具体实现细节,本文规范了CMA所施加的架构要求,并展示了在那些揭示RAG结构上无法积累、变异或消除记忆的任务上的行为优势。通过知识更新、时间关联、联想回忆以及情境消歧等实证探针演示,证明CMA是长期范围代理所需的基本架构单元,同时也突显了围绕延迟、漂移和可解释性的开放挑战。
https://arxiv.org/abs/2601.09913
Most existing Large Language Model (LLM)-based Multi-Agent Systems (MAS) rely on predefined workflows, where human engineers enumerate task states in advance and specify routing rules and contextual injections accordingly. Such workflow-driven designs are essentially rule-based decision trees, which suffer from two fundamental limitations: they require substantial manual effort to anticipate and encode possible task states, and they cannot exhaustively cover the state space of complex real-world tasks. To address these issues, we propose an Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication from CORAL, in which a dedicated information flow orchestrator continuously monitors task progress and dynamically coordinates other agents through the A2A toolkit using natural language, without relying on predefined workflows. We evaluate our approach on the general-purpose benchmark GAIA, using the representative workflow-based MAS OWL as the baseline while controlling for agent roles and underlying models. Under the pass@1 setting, our method achieves 63.64% accuracy, outperforming OWL's 55.15% by 8.49 percentage points with comparable token consumption. Further case-level analysis shows that our paradigm enables more flexible task monitoring and more robust handling of edge cases. Our implementation is publicly available at: this https URL
大多数现有的基于大型语言模型(LLM)的多智能体系统(MAS)依赖于预定义的工作流程,其中人类工程师提前列举任务状态,并相应地指定路由规则和上下文注入。这种工作流驱动的设计本质上是基于规则的决策树,存在两个根本性的局限:需要大量的手动劳动来预见并编码可能的任务状态,且无法详尽无遗地覆盖复杂现实世界任务的状态空间。为解决这些问题,我们提出了CORAL(信息流动编排多智能体范式)中的Agent-to-Agent(A2A)通信信息流编排者,该编排者通过持续监控任务进度并使用自然语言动态协调其他代理来工作,而无需依赖预定义的工作流程。 我们在通用基准GAIA上评估了我们的方法,并在控制智能体角色和底层模型的前提下,将基于代表性的工作流的MAS OWL作为基线。在pass@1设置下,我们实现了63.64%的准确率,比OWL的55.15%高出了8.49个百分点,且消耗的token数量相当。进一步的案例分析表明,我们的范式能够实现更灵活的任务监控和更好地处理边缘情况。 我们的实施方案可在以下链接公开获取:this https URL
https://arxiv.org/abs/2601.09883
Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.
将非人类实体赋予类似人类特质的现象——即拟人化,在基于大型语言模型(LLM)的对话代理(CA)兴起后变得日益显著。与早期聊天机器人不同,基于LLM的CA能够常规生成交互性和语言暗示,如第一人称自我引用、知识和情感表达等,这些都已被实证研究表明可以增加互动性。然而,拟人化也引发了一系列伦理问题,包括欺骗、过度依赖以及剥削性的关系框架,但也有观点认为拟人化的互动可能支持自主权、福祉及包容性。尽管人们对这一现象的兴趣日益浓厚,相关文献仍分散在各个领域,并且在定义、操作和规范评估方面存在显著差异。 这项综述文章基于五个数据库和三个预印本仓库,梳理了有关将LLM基础CA拟人化的伦理导向研究工作。本文综合分析了(1)概念基础;(2)伦理挑战与机遇;以及(3)方法论途径。我们发现,在定义上倾向于归因为基础的共识,但在操作层面存在显著分歧,并且规范性评估主要侧重于风险,同时缺少将观察到的互动效应链接至实际治理指导的实证研究。 最后,本文提出了一份研究议程和设计/治理建议,旨在为在基于LLM的对话代理中道德地使用拟人化提示提供指导。
https://arxiv.org/abs/2601.09869
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
随着大型语言模型(LLM)的规模不断扩大,后训练修剪作为一种有望在减少计算成本的同时保持性能的方法逐渐崭露头角。现有方法如SparseGPT和Wanda通过逐层权重重构或基于激活量化的幅度剪枝实现高稀疏度,但它们依赖于统一的手工启发式来确定每层的稀疏比率。此外,最近的研究表明,被修剪的LLM在事实性知识方面有显著下降,结构化修剪方法在这种情况下几乎完全丧失了事实问答能力。我们引入了代理引导剪枝,其中基础模型充当自适应剪枝代理,在每次迭代中智能地选择要修剪的层,同时保持关键的知识路径不变。我们的方法通过结合Wanda启发式的权重-激活度量与梯度重要性得分来构建逐层敏感度概况,并将其归一化为z分数以进行模型无关比较。这些统计信息由具有自我反思能力的LLM代理处理,使其能够从先前的修剪结果中学习并迭代优化其策略。一种检查点回滚机制通过在困惑度下降超过阈值时恢复来保持模型质量。我们在Qwen3模型(4B和8B参数)上大约以45%的稀疏度评估我们的方法,结果显示相对于结构化剪枝基线有显著改进:MMLU准确率提高了56%,FreebaseQA上的事实性知识保留提高了19倍,困惑度下降降低了69%。值得注意的是,我们的框架不需要重新训练,在模型无关的方式下运行,并且仅通过2-4次回滚在21-40次迭代中表现出有效的自我纠正能力,表明基础模型可以有效地指导其他基础模型的压缩过程。
https://arxiv.org/abs/2601.09694
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
深度研究系统被广泛应用于多步骤的网页研究、分析以及跨来源综合,然而它们的评估仍然面临挑战。现有的基准测试通常需要密集的手动标注来构建任务,依赖静态的评价维度,或者在缺少引用时无法可靠地验证事实。为了弥补这些不足,我们介绍了DeepResearchEval,这是一个用于深度研究任务构建和代理式评估的自动化框架。 对于任务构建部分,我们提出了一种以人物角色驱动的工作流程,该工作流程能够生成基于多样化用户配置文件的真实且复杂的深度研究任务,并通过两个阶段的任务资格和搜索必要性过滤器来保留那些需要多来源证据整合以及外部检索的任务。 在评价方面,我们提出了一个代理式管道方案,包含两个组成部分:一个是自适应的点对点质量评估系统,该系统能够根据每个生成的任务动态地推导出具体化的评价维度、标准和权重;另一个是主动的事实核查机制,它能够在缺少引用的情况下通过网络搜索自主提取并验证报告中的陈述。
https://arxiv.org/abs/2601.09688
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
多智能体系统已经演变为许多应用程序中由大型语言模型(LLM)驱动的实用协作工具,从多样性和互查中获得了稳健性。然而,多代理强化学习(MARL)训练资源密集且不稳定:协同适应的队友会导致非平稳环境,并且奖励往往稀疏且方差大。因此,我们引入了**多代理推理时间强化学习(MATTRL)**框架,在推理时将结构化的文本经验注入到多智能体决策过程中。MATTRL形成一个多专家团队以进行多次轮次讨论,检索和整合推理时间的经验,并达成共识做出最终决定。此外,我们研究了信用分配机制来构建每个回合级别的经验池,然后将其重新注入对话中。在医学、数学和教育等具有挑战性的基准测试上,与多代理基线相比,MATTRL提高了3.67%的准确率;与相当的单个智能体基线相比,则提高了8.67%的准确率。消融研究分析了不同的信用分配方案,并详细比较它们如何影响训练结果。MATTRL为分布变化鲁棒性的多代理推理提供了一个稳定、有效且高效的路径,无需调参。
https://arxiv.org/abs/2601.09667
While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
尽管基于图形用户界面(GUI)的代理在接收明确和完成指令时表现出色,但现实世界的部署要求它们能够与用户的复杂隐含意图对齐。在此项工作中,我们介绍了“PersonalAlign”,即个性化 GUI 代理的任务,该任务需要代理利用长期用户记录作为持续上下文来解决模糊指示中遗漏的偏好,并通过推理长期用户记录为用户提供前瞻性建议。 为了支持这项研究,我们引入了 AndroidIntent 基准测试,旨在评估代理在解析模糊指令并通过推理长期用户记录提供前瞻性建议方面的能力。我们在不同用户的 20,000 条长期记录中标记了 775 种特定于用户的偏好和 215 个常规流程,以进行评价。 此外,我们介绍了层次化意图记忆代理(Hierarchical Intent Memory Agent,简称 HIM-Agent),该代理能够维护一个持续更新的个人记忆,并分层组织用户偏好和常规流程,从而实现个性化。最后,在 AndroidIntent 基准测试中评估了一系列 GUI 代理,包括 GPT-5、Qwen3-VL 和 UI-TARS,进一步的结果表明,HIM-Agent 在执行性能和前瞻性性能方面分别提高了 15.7% 和 7.3%。
https://arxiv.org/abs/2601.09636
Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
尽管大型语言模型(LLM)在自然语言处理任务中表现出色,但在诸如押韵检测和生成这类以音系为基础的现象上仍存在挑战。尤其对于资源较少的语言,例如现代希腊语,这一问题更加明显。本文介绍了一种混合系统,该系统结合了LLM与确定性音系算法,能够实现准确的押韵识别/分析及生成。 我们的方法实现了全面的希腊语押韵类型分类,包括纯式、丰富式、不完整式、马赛克式和同前元音(IDV)模式,并采用了具有音系验证的代理生成管道。我们评估了多种提示策略(零样本、少样本、链式思维、RAG增强),并测试了多个LLM模型,包括Claude 3.7和4.5、GPT-4o、Gemini 2.0以及开源模型如Llama 3.1(8B和70B)和Mistral Large。实验结果显示了一个显著的“推理差距”:虽然母语级别的模型(例如,Claude 3.7)能够直观地完成任务(识别准确率为40%),但在少样本情况下需要更多推理能力的模型(如Claude 4.5)仅在使用链式思维提示时才能达到最先进的性能(54%)。最关键的是,纯LLM生成完全失败(有效诗歌低于4%),而我们的混合验证循环将性能恢复到了73.1%。 我们公开了该系统以及一套关键的、严格清理过的数据集,包含40,000多个押韵语料,来源于Anemoskala和战间期诗歌语料库。这些资源为未来的研究提供了支持。
https://arxiv.org/abs/2601.09631
Modern LLM-based recommenders can generate compelling ranked lists, but they struggle to reliably satisfy governance constraints such as minimum long-tail exposure or diversity requirements. We present PCN-Rec, a proof-carrying negotiation pipeline that separates natural-language reasoning from deterministic enforcement. A base recommender (MF/CF) produces a candidate window of size W, which is negotiated by two agents: a User Advocate optimizing relevance and a Policy Agent enforcing constraints. A mediator LLM synthesizes a top-N slate together with a structured certificate (JSON) describing the claimed constraint satisfaction. A deterministic verifier recomputes all constraints from the slate and accepts only verifier-checked certificates; if verification fails, a deterministic constrained-greedy repair produces a compliant slate for re-verification, yielding an auditable trace. On MovieLens-100K with governance constraints, PCN-Rec achieves a 98.55% pass rate on feasible users (n = 551, W = 80) versus a one-shot single-LLM baseline without verification/repair, while preserving utility with only a 0.021 absolute drop in NDCG@10 (0.403 vs. 0.424); differences are statistically significant (p < 0.05).
现代基于大型语言模型(LLM)的推荐系统能够生成引人注目的排名列表,但它们难以可靠地满足诸如最小长尾曝光率或多样性要求等治理约束。我们提出了PCN-Rec,这是一种带证明传输协商管道的方法,它将自然语言推理与确定性执行分开。基础推荐器(MF/CF)会生成一个大小为W的候选窗口,然后由两个代理进行谈判:一个是优化相关性的用户代表,另一个是实施约束的政策代理。中介LLM会根据候选窗口合成一个包含N项的最佳列表,并附带一份结构化的证书(JSON格式),描述所声称的约束满足情况。确定性验证器会从最佳列表中重新计算所有约束条件,并仅接受通过验证检查的证书;如果验证失败,确定性的受约束贪婪修复将生成符合要求的新列表供再次验证,从而产生可审计的轨迹。在具有治理约束的MovieLens-100K数据集上,PCN-Rec对于可行用户(n=551, W=80)达到了98.55%的通过率,而没有验证/修复步骤的一次性单一LLM基线系统仅达到72.4%的通过率;同时,PCN-Rec在保持推荐效用的情况下,NDCG@10指标仅有0.021(从0.424降至0.403)的绝对下降,差异具有统计学意义(p<0.05)。
https://arxiv.org/abs/2601.09771
Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.
由于现实世界查询流的动态演变特性,相关性模型难以泛化到实际搜索场景中。一个复杂的解决方案是自我进化技术。然而,在大型工业环境中面对海量查询流时,这种技术面临两个挑战:(1)信息样本通常稀疏且难以识别;(2)当前模型生成的伪标签可能是不可靠的。为了应对这些挑战,本工作提出了自演化相关性模型方法(SERM),该方法包含两个互补的多智能体模块:一个多智能体样本挖掘器,设计用于检测分布变化并识别有信息量的训练样本;一个多智能体相关性标注器,通过两级协议框架提供可靠的标签。我们在一个大规模工业环境中对SERM进行了评估,该环境每天处理数十亿用户的请求。实验结果显示,通过迭代自演化,SERM能够实现显著的性能提升,并且这一效果已在广泛的离线多语言评估和在线测试中得到验证。
https://arxiv.org/abs/2601.09515
Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.
近期,视觉-语言模型(VLMs)和强化学习(RL)的进步推动了GUI自动化的发展。然而,大多数现有方法依赖于静态、一次性视觉输入及被动感知,缺乏适应性地确定何时、是否以及如何观察界面的能力。我们提出了GUI-Eyes,这是一个用于GUI任务中主动视觉感知的强化学习框架。为了获取更多有信息量的观测数据,代理通过两个阶段的推理过程学会了在是否以及如何调用视觉工具(如裁剪或放大)上做出战略性决策。为支持这一行为,我们引入了一种渐进式感知策略,将决策制定分解为粗略探索和精细定位,并由两级政策协调。此外,我们设计了一个针对工具使用量身定制的连续空间奖励函数,该函数综合了位置接近度和地区重叠来提供密集监督并缓解GUI环境中常见的奖励稀疏性问题。 在ScreenSpot-Pro基准测试中,仅使用3k个标记样本,GUI-Eyes-3B就达到了44.8%的定位准确率,大幅超越了监督式和基于RL的方法。这些结果强调了通过分阶段策略推理和精细程度反馈实现工具感知主动感知对于构建稳健且数据高效的GUI代理而言至关重要。
https://arxiv.org/abs/2601.09770
Judicial reasoning in copyright damage awards poses a core challenge for computational legal analysis. Although federal courts follow the 1976 Copyright Act, their interpretations and factor weightings vary widely across jurisdictions. This inconsistency creates unpredictability for litigants and obscures the empirical basis of legal decisions. This research introduces a novel discourse-based Large Language Model (LLM) methodology that integrates Rhetorical Structure Theory (RST) with an agentic workflow to extract and quantify previously opaque reasoning patterns from judicial opinions. Our framework addresses a major gap in empirical legal scholarship by parsing opinions into hierarchical discourse structures and using a three-stage pipeline, i.e., Dataset Construction, Discourse Analysis, and Agentic Feature Extraction. This pipeline identifies reasoning components and extract feature labels with corresponding discourse subtrees. In analyzing copyright damage rulings, we show that discourse-augmented LLM analysis outperforms traditional methods while uncovering unquantified variations in factor weighting across circuits. These findings offer both methodological advances in computational legal analysis and practical insights into judicial reasoning, with implications for legal practitioners seeking predictive tools, scholars studying legal principle application, and policymakers confronting inconsistencies in copyright law.
司法判决中的版权损害赔偿推理对计算法律分析构成了核心挑战。尽管联邦法院遵循1976年的《版权法》,但各管辖区在解释和权衡因素方面的差异极大,这种不一致性为诉讼当事人带来了不确定性,并且模糊了法律决策的实证基础。本研究引入了一种基于话语的大语言模型(LLM)方法,该方法结合修辞结构理论(Rhetorical Structure Theory, RST)与代理工作流程来提取和量化以往隐藏在司法意见中的推理模式。我们的框架通过解析成层次化的论述结构,并利用一个三阶段管道——即数据集构建、话语分析以及代理特征抽取,填补了实证法律研究中的一大空白。此管道能够识别推理成分,并从相应的论述子树中抽取出对应的特征标签。 通过对版权损害判决的分析,我们展示了增强型的话语LLM分析超越传统方法的表现,同时揭示了各巡回法院在权衡因素时未被量化的变化情况。这些发现为计算法律分析提供了方法论上的进步,并对司法推理提供了实践见解,对于寻求预测工具的法律从业者、研究法律原则应用的学者以及面对版权法不一致性的政策制定者均具有重要意义。
https://arxiv.org/abs/2601.09459
Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: this https URL
最近的视频扩散模型能够生成逼真的、在时间上连贯的视频,但它们作为自主驾驶所需的世界模型时却显得不足。因为在自主驾驶中,结构化的运动和物理一致性的互动至关重要。将这些通用型视频模型适应于驾驶领域显示出了一定的潜力,但这通常需要大量的特定领域的数据以及昂贵的微调过程。我们提出了一种高效的转换框架,该框架能够把通用型视频扩散模型转变成可控的驾驶世界模型,并且只需要最少的监督。 该方法的核心思想是将运动学习从外观合成中分离出来。首先,让模型适应于以简化形式预测结构化运动:即生成骨架化的代理(例如车辆、行人等)和场景元素组成的视频片段,重点在于物理合理性和社会合理性上。接着,使用相同的模型主体来根据这些运动序列生成真实的RGB视频,有效地为这种动态“穿衣”加上纹理和光照效果。这个两阶段的过程模仿了推理-渲染的模式:先推断动态行为,再呈现外观。 我们的实验表明这种方法的分离式方法异常高效:适应SVD时,我们使用不到其计算资源6%的部分即可与先前最佳模型匹敌。当我们扩展到LTX时,我们的MAD-LTX模型在所有开源竞争者中表现最佳,并且支持全面的文字、自我(例如自身车辆)和物体控制功能。 项目页面: [此链接](https://this-url.com)
https://arxiv.org/abs/2601.09452
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
我们介绍了一个具备声音代理特性的框架,该框架学会了关键的全方位理解技能:即知道何时信任自身的能力,以及何时需要咨询外部音频感知。我们的工作是由一个至关重要的但违反直觉的研究结果所驱动的:简单地在语音识别和外部声音理解任务上微调一个全能模型往往会降低性能,因为模型容易被噪声假设误导。为了解决这个问题,我们开发了一个名为Speech-Hands的框架,它将问题重新定义为明确的自我反思决策。这种可学习的反思基础证明了防止模型因错误的外部候选答案而偏离轨道的有效性。我们展示了这种代理行动机制能够从语音识别自然地推广到复杂的多选题音频推理。在OpenASR排行榜上,Speech-Hands在这七个基准测试中始终优于强大的基线模型,平均提升了12.1%的单词错误率(WER)。此外,在音频问答决策方面,该模型达到了77.37%的准确率和高F1值,展示了其在多种多样的音频问题回答数据集上的强大泛化能力和可靠性。通过统一感知与决策制定,我们的工作为创建更加可靠且适应性强的音频智能提供了一条实用的道路。
https://arxiv.org/abs/2601.09413
Current large language model agents predominantly operate under a reactive paradigm, responding only to immediate user queries within short-term sessions. This limitation hinders their ability to maintain long-term user's intents and dynamically adapt to evolving external environments. In this paper, we propose a novel interaction paradigm for proactive Task-oriented Agents capable of bridging the gap between relatively static user's needs and a dynamic environment. We formalize proactivity through two key capabilities, (i) Intent-Conditioned Monitoring: The agent autonomously formulates trigger conditions based on dialog history; (ii) Event-Triggered Follow-up: The agent actively engages the user upon detecting useful environmental updates. We introduce a high-quality data synthesis pipeline to construct complex, multi-turn dialog data in a dynamic environment. Furthermore, we attempt to address the lack of evaluation criteria of task-oriented interaction in a dynamic environment by proposing a new benchmark, namely ChronosBench. We evaluated some leading close-source and open-source models at present and revealed their flaws in long-term task-oriented interaction. Furthermore, our fine-tuned model trained using synthetic data for supervised learning achieves a task completion rate of 85.19% for complex tasks including shifts in user intent, outperforming other models under test. And the result validated the effectiveness of our data-driven strategy.
当前的大规模语言模型代理主要采用反应式范例运作,仅对短期会话中的即时用户查询做出回应。这种限制阻碍了它们维持长期用户的意图和适应不断变化的外部环境的能力。在本文中,我们提出了一种新的交互模式,用于能够弥合相对静态的用户需求与动态环境之间差距的任务导向型代理。通过两种关键能力来正式定义主动性:(i) 意图条件监控:代理根据对话历史自主制定触发条件;(ii) 事件触发跟进:当检测到有用的环境更新时,代理会主动联系用户。 我们引入了一个高质量的数据合成流水线,用于在动态环境中构建复杂的多轮对话数据。此外,鉴于缺乏评估任务导向型交互在动态环境中的标准方法,我们提出了一个新的基准测试框架——ChronosBench。我们在该基准上对现有的几个闭源和开源模型进行了评价,并揭示了它们在长期任务导向性互动中存在的不足之处。 通过使用合成数据进行监督学习训练,我们的微调模型在包括用户意图转变等复杂任务上的完成率为85.19%,优于其他测试模型的性能。这一结果验证了我们基于数据驱动策略的有效性。
https://arxiv.org/abs/2601.09382