Personalizing Large Language Models typically relies on static retrieval or one-time adaptation, assuming user preferences remain invariant over time. However, real-world interactions are dynamic, where user interests continuously evolve, posing a challenge for models to adapt to preference drift without catastrophic forgetting. Standard continual learning approaches often struggle in this context, as they indiscriminately update on noisy interaction streams, failing to distinguish genuine preference shifts from transient contexts. To address this, we introduce SPRInG, a novel semi-parametric framework designed for effective continual personalization. During training, SPRInG employs drift-driven selective adaptation, which utilizes a likelihood-based scoring function to identify high-novelty interactions. This allows the model to selectively update the user-specific adapter on drift signals while preserving hard-to-learn residuals in a replay buffer. During inference, we apply strict relevance gating and fuse parametric knowledge with retrieved history via logit interpolation. Experiments on the long-form personalized generation benchmark demonstrate that SPRInG outperforms existing baselines, validating its robustness for real-world continual personalization.
个性化大型语言模型通常依赖静态检索或一次性适应,假设用户偏好在长时间内保持不变。然而,在现实世界中,用户的兴趣会不断变化,这使得模型很难适应这种偏好漂移而不出现灾难性遗忘。标准的连续学习方法在这种情况下往往力不从心,因为它们无法区分真正的偏好转变和短暂上下文中的噪音信息而盲目更新这些数据流。 为了应对这一挑战,我们提出了SPRInG——一种全新的半参数框架,旨在实现有效的持续个性化。在训练阶段,SPRInG采用基于漂移的选择性适应策略,利用基于可能性的评分函数来识别高新颖度互动,并让模型选择性地根据漂移信号更新用户特定的适配器,同时将难以学习的数据保留在重放缓冲区中。 在推断阶段,我们应用严格的关联门控机制,并通过对数插值融合参数化知识与检索到的历史信息。实验表明,在长格式个性化生成基准测试上,SPRInG的表现优于现有的基线方法,这验证了其在现实世界持续个性化中的鲁棒性。
https://arxiv.org/abs/2601.09974
Recent advances in large language models (LLMs) have led to substantial progress in domain-specific applications, particularly within the legal domain. However, general-purpose models such as GPT-4 often struggle with specialized subdomains that require precise legal knowledge, complex reasoning, and contextual sensitivity. To address these limitations, we present LabourLawLLM, a legal large language model tailored to Chinese labor law. We also introduce LabourLawBench, a comprehensive benchmark covering diverse labor-law tasks, including legal provision citation, knowledge-based question answering, case classification, compensation computation, named entity recognition, and legal case analysis. Our evaluation framework combines objective metrics (e.g., ROUGE-L, accuracy, F1, and soft-F1) with subjective assessment based on GPT-4 scoring. Experiments show that LabourLawLLM consistently outperforms general-purpose and existing legal-specific LLMs across task categories. Beyond labor law, our methodology provides a scalable approach for building specialized LLMs in other legal subfields, improving accuracy, reliability, and societal value of legal AI applications.
最近在大型语言模型(LLM)方面的进展已经在特定领域的应用中取得了显著进步,特别是在法律领域。然而,像GPT-4这样的通用模型在处理需要精确法律知识、复杂推理和情境敏感性的专业子领域时常常遇到困难。为了解决这些限制,我们推出了LabourLawLLM,这是一个专门为中国劳动法定制的法律大型语言模型。此外,我们还介绍了LabourLawBench,一个全面基准测试集,涵盖了各种劳动法任务,包括法律条文引用、基于知识的问题回答、案件分类、赔偿计算、命名实体识别以及案例分析。我们的评估框架结合了客观指标(如ROUGE-L、准确率、F1和soft-F1)与基于GPT-4评分的主观评价。实验结果显示,在各类任务上,LabourLawLLM始终优于通用模型和其他特定法律领域的现有语言模型。除了劳动法之外,我们的方法还为在其他法律子领域构建专业化的大型语言模型提供了一种可扩展的方法,从而提高了法律AI应用的准确性、可靠性和社会价值。
https://arxiv.org/abs/2601.09972
Large Language Models (LLMs) and Large Reasoning Models (LRMs) offer transformative potential for high-stakes domains like finance and law, but their tendency to hallucinate, generating factually incorrect or unsupported content, poses a critical reliability risk. This paper introduces a comprehensive operational framework for hallucination management, built on a continuous improvement cycle driven by root cause awareness. We categorize hallucination sources into model, data, and context-related factors, allowing targeted interventions over generic fixes. The framework integrates multi-faceted detection methods (e.g., uncertainty estimation, reasoning consistency) with stratified mitigation strategies (e.g., knowledge grounding, confidence calibration). We demonstrate its application through a tiered architecture and a financial data extraction case study, where model, context, and data tiers form a closed feedback loop for progressive reliability enhancement. This approach provides a systematic, scalable methodology for building trustworthy generative AI systems in regulated environments.
大型语言模型(LLMs)和大型推理模型(LRMs)在金融和法律等高风险领域提供了变革性的潜力,但它们倾向于生成事实错误或缺乏支持的内容这一特性构成了一个关键的可靠性风险。本文介绍了一个全面的操作框架,用于管理幻觉问题,该框架建立在一个由根本原因意识驱动的持续改进循环之上。我们将幻觉来源分为模型相关、数据相关和上下文相关的因素,从而允许进行有针对性的干预而非通用性的修复措施。该框架结合了多方面的检测方法(如不确定性估计、推理一致性)与分层缓解策略(如知识定位、信心校准)。我们通过一个分层架构和金融数据提取案例研究展示了其应用,在这些案例中,模型、上下文和数据层级形成了一个封闭的反馈循环,用于逐步提高可靠性。这一方法提供了一种系统化且可扩展的方法论,以在监管环境中构建值得信赖的生成式AI系统。
https://arxiv.org/abs/2601.09929
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss. The only known robust defense is architectural isolation that strictly separates trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs) -- systems that automate tasks by viewing screens and executing actions -- presents a fundamental challenge: current agents require continuous observation of UI state to determine each action, conflicting with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content, providing provable control flow integrity guarantees against arbitrary instruction injections. Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks, which manipulate UI elements to trigger unintended valid paths within the plan. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs.
AI代理容易受到提示注入攻击,这种攻击通过恶意内容劫持代理行为来盗取凭证或造成经济损失。目前唯一已知的稳健防御措施是架构隔离,即严格分离受信任的任务规划与不受信任的环境观察。然而,将这一设计应用于计算机使用代理(CUAs)——这类系统通过查看屏幕并执行操作来自动化任务时,会遇到一个根本性的挑战:当前的代理需要持续观察用户界面的状态以确定每个动作,这与所需的隔离措施相冲突。 我们解决了这种紧张关系,展示了UI工作流程虽然动态但具有结构性可预测性。我们引入了CUAs的一次性规划方法,在观测到任何潜在恶意内容之前,受信任的计划器会生成一个完整的执行图,其中包括条件分支,从而为任意指令注入提供可控流完整性保证。 尽管这种架构隔离成功地防止了指令注入攻击,但我们展示了需要额外措施来防范分支引导攻击,这类攻击通过操纵UI元素触发规划中的未预期有效路径。我们在OSWorld上评估了我们的设计,并在前沿模型中保留了高达57%的性能,在开源的小型模型中性能提高了最多19%,这表明严格的保障与实用性可以在CUAs中共存。 翻译如下: AI代理容易受到提示注入攻击,这种攻击利用恶意内容劫持代理行为来盗取凭证或造成经济损失。目前唯一已知的有效防御手段是架构隔离,即严格分离受信任的任务规划和不受信任的环境观察。然而,在计算机使用代理(CUAs)——这类系统通过查看屏幕并执行操作来自动化任务时应用这一设计面临一个根本性挑战:当前代理需要持续监控用户界面的状态以决定每个动作,这与为安全所需的隔离措施相冲突。 我们解决了这一矛盾,展示了尽管UI工作流程动态多变,但它们具有结构性可预测性的特点。为此,我们引入了一种CUAs的一次性规划方法,在观察到任何潜在恶意内容之前,受信任的计划器便生成了一个包含条件分支的完整执行图,从而确保了对任意指令注入的有效控制流完整性保障。 尽管这种架构隔离成功防止了指令注入攻击,但我们展示了还需要额外措施来防范所谓的“分支引导”攻击,这类攻击通过操控UI元素触发规划中未预期到的有效路径。我们在OSWorld平台上评估了我们的设计,并且在前沿模型中的性能保留高达57%,同时使开源的小型模型的性能提升最多19%。这表明,在CUAs中,严格的保障措施和实用性可以并存。
https://arxiv.org/abs/2601.09923
Retrieval-augmented generation (RAG) has become the default strategy for providing large language model (LLM) agents with contextual knowledge. Yet RAG treats memory as a stateless lookup table: information persists indefinitely, retrieval is read-only, and temporal continuity is absent. We define the \textit{Continuum Memory Architecture} (CMA), a class of systems that maintain and update internal state across interactions through persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher-order abstractions. Rather than disclosing implementation specifics, we specify the architectural requirements CMA imposes and show consistent behavioral advantages on tasks that expose RAG's structural inability to accumulate, mutate, or disambiguate memory. The empirical probes (knowledge updates, temporal association, associative recall, contextual disambiguation) demonstrate that CMA is a necessary architectural primitive for long-horizon agents while highlighting open challenges around latency, drift, and interpretability.
基于检索增强的生成(RAG)已成为向大型语言模型(LLM)代理提供上下文知识的默认策略。然而,RAG将记忆视为无状态查找表:信息无限期保留、检索为只读模式且缺乏时间连续性。我们定义了“连续内存架构”(CMA),这是一种通过持久存储、选择性保留、关联路由、时间链接和整合到更高层次抽象来维护和更新内部状态的系统类。与其披露具体实现细节,本文规范了CMA所施加的架构要求,并展示了在那些揭示RAG结构上无法积累、变异或消除记忆的任务上的行为优势。通过知识更新、时间关联、联想回忆以及情境消歧等实证探针演示,证明CMA是长期范围代理所需的基本架构单元,同时也突显了围绕延迟、漂移和可解释性的开放挑战。
https://arxiv.org/abs/2601.09913
Most existing Large Language Model (LLM)-based Multi-Agent Systems (MAS) rely on predefined workflows, where human engineers enumerate task states in advance and specify routing rules and contextual injections accordingly. Such workflow-driven designs are essentially rule-based decision trees, which suffer from two fundamental limitations: they require substantial manual effort to anticipate and encode possible task states, and they cannot exhaustively cover the state space of complex real-world tasks. To address these issues, we propose an Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication from CORAL, in which a dedicated information flow orchestrator continuously monitors task progress and dynamically coordinates other agents through the A2A toolkit using natural language, without relying on predefined workflows. We evaluate our approach on the general-purpose benchmark GAIA, using the representative workflow-based MAS OWL as the baseline while controlling for agent roles and underlying models. Under the pass@1 setting, our method achieves 63.64% accuracy, outperforming OWL's 55.15% by 8.49 percentage points with comparable token consumption. Further case-level analysis shows that our paradigm enables more flexible task monitoring and more robust handling of edge cases. Our implementation is publicly available at: this https URL
大多数现有的基于大型语言模型(LLM)的多智能体系统(MAS)依赖于预定义的工作流程,其中人类工程师提前列举任务状态,并相应地指定路由规则和上下文注入。这种工作流驱动的设计本质上是基于规则的决策树,存在两个根本性的局限:需要大量的手动劳动来预见并编码可能的任务状态,且无法详尽无遗地覆盖复杂现实世界任务的状态空间。为解决这些问题,我们提出了CORAL(信息流动编排多智能体范式)中的Agent-to-Agent(A2A)通信信息流编排者,该编排者通过持续监控任务进度并使用自然语言动态协调其他代理来工作,而无需依赖预定义的工作流程。 我们在通用基准GAIA上评估了我们的方法,并在控制智能体角色和底层模型的前提下,将基于代表性的工作流的MAS OWL作为基线。在pass@1设置下,我们实现了63.64%的准确率,比OWL的55.15%高出了8.49个百分点,且消耗的token数量相当。进一步的案例分析表明,我们的范式能够实现更灵活的任务监控和更好地处理边缘情况。 我们的实施方案可在以下链接公开获取:this https URL
https://arxiv.org/abs/2601.09883
Human-AI complementarity is the claim that a human supported by an AI system can outperform either alone in a decision-making process. Since its introduction in the human-AI interaction literature, it has gained traction by generalizing the reliance paradigm and by offering a more practical alternative to the contested construct of 'trust in AI.' Yet complementarity faces key theoretical challenges: it lacks precise theoretical anchoring, it is formalized just as a post hoc indicator of relative predictive accuracy, it remains silent about other desiderata of human-AI interactions and it abstracts away from the magnitude-cost profile of its performance gain. As a result, complementarity is difficult to obtain in empirical settings. In this work, we leverage epistemology to address these challenges by reframing complementarity within the discourse on justificatory AI. Drawing on computational reliabilism, we argue that historical instances of complementarity function as evidence that a given human-AI interaction is a reliable epistemic process for a given predictive task. Together with other reliability indicators assessing the alignment of the human-AI team with the epistemic standards and socio-technical practices, complementarity contributes to the degree of reliability of human-AI teams when generating predictions. This supports the practical reasoning of those affected by these outputs -- patients, managers, regulators, and others. In summary, our approach suggests that the role and value of complementarity lies not in providing a relative measure of predictive accuracy, but in helping calibrate decision-making to the reliability of AI-supported processes that increasingly shape everyday life.
人类与AI的互补性是指,在决策过程中,由AI系统支持的人类团队的表现能超越单独的人或AI。自这一概念在人机交互文献中提出以来,它通过推广依赖理论并提供了一种更实用的替代方案而获得了关注——这种替代方案避免了争议性的“对AI的信任”构想。然而,互补性也面临一些关键的理论挑战:它缺乏精确的理论基础;仅被形式化为事后预测准确性的相对指标;在其他人类与AI互动的需求方面没有提供信息,并且忽略了其性能增益的成本效益概况。因此,在经验研究中获得互补性较为困难。 在这项工作中,我们利用认识论来应对这些挑战,通过重新定义互补性以适应关于验证性AI的讨论框架内。基于计算可靠性理论,我们认为历史上观察到的互补性实例可以作为证据表明特定的人机互动对于给定预测任务来说是一个可靠的知识过程。此外,与评估人机团队是否符合知识标准和社会技术实践的一致性的其他可靠性指标结合使用时,互补性能增强人类与AI团队在生成预测方面所达到的可靠度。 这支持了受这些输出影响者的实用推理——包括患者、管理者、监管者和其他相关方。总之,我们的方法表明,互补性的作用和价值不在于提供相对准确性的衡量标准,而在于帮助调整决策过程以适应由AI辅助的过程所形成的日益重要的可靠性水平,这些过程正在越来越多地塑造日常生活的各个方面。
https://arxiv.org/abs/2601.09871
Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.
将非人类实体赋予类似人类特质的现象——即拟人化,在基于大型语言模型(LLM)的对话代理(CA)兴起后变得日益显著。与早期聊天机器人不同,基于LLM的CA能够常规生成交互性和语言暗示,如第一人称自我引用、知识和情感表达等,这些都已被实证研究表明可以增加互动性。然而,拟人化也引发了一系列伦理问题,包括欺骗、过度依赖以及剥削性的关系框架,但也有观点认为拟人化的互动可能支持自主权、福祉及包容性。尽管人们对这一现象的兴趣日益浓厚,相关文献仍分散在各个领域,并且在定义、操作和规范评估方面存在显著差异。 这项综述文章基于五个数据库和三个预印本仓库,梳理了有关将LLM基础CA拟人化的伦理导向研究工作。本文综合分析了(1)概念基础;(2)伦理挑战与机遇;以及(3)方法论途径。我们发现,在定义上倾向于归因为基础的共识,但在操作层面存在显著分歧,并且规范性评估主要侧重于风险,同时缺少将观察到的互动效应链接至实际治理指导的实证研究。 最后,本文提出了一份研究议程和设计/治理建议,旨在为在基于LLM的对话代理中道德地使用拟人化提示提供指导。
https://arxiv.org/abs/2601.09869
Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model's maximum context length, and under mild conditions has linear computational complexity.
顺序测试时间缩放是一种有前景的无需训练的方法,用于提升大型推理模型的准确性。然而,目前实现的方式显示出显著的局限性:延长模型思考时间虽然可以提高其准确性,但随着推理长度进一步增加,也导致了准确性的下降和模型不稳定的问题。 本文提出了一种新颖的顺序测试时间缩放方法——Min-Seek,这种方法在广泛的诱导思维范围内大幅提升了模型的准确性,并稳定了序列扩展的准确性,同时消除了对推理长度微调的需求。除了解决多种推理任务中的模型准确性问题外,我们的方法本质上是高效的,在推理过程中仅需保持一个额外的诱导思维的键值(KV)对缓存。 通过使用自定义的KV缓存机制——该机制存储不带位置嵌入的键,并在每次生成新思维前动态连续编码这些键——我们的方法可以使得模型在其最大上下文长度之外继续进行高质量推理。在适度条件下,这种方法具有线性计算复杂度,从而进一步提高了其效率和实用性。
https://arxiv.org/abs/2601.09855
Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead, hybrid approaches depend on external components, which limit their scalability. A non-interactive, end-to-end framework enables reasoning to emerge within the model itself -- improving generalization while preserving analyzability without any external resources. In this work, we introduce a non-interactive, end-to-end framework for reasoning tasks. We show that introducing structural information into the few-shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention-Aware Intervention (AAI), an inference-time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model's reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks and model architectures, while incurring negligible additional computational overhead. Code is available at this https URL.
现代逻辑推理中使用大规模语言模型(LLMs)主要依赖于复杂交互式框架,这些框架将推理过程分解为通过精心设计的提示或需要外部资源(如符号求解器)解决的子任务。虽然交互式方法引入了额外的开销,但混合方法则依赖于外部组件,这限制了其可扩展性。非交互式的端到端框架使模型自身能够进行推理——在不使用任何外部资源的情况下提高泛化能力并保持分析性。 在这项工作中,我们介绍了一个用于推理任务的非交互式、端到端框架。我们发现,在少量样本提示中引入结构信息激活了一组注意力头,这些头与符合逻辑运算符模式相关联。基于这一见解,我们提出了注意力感知干预(AAI),这是一种推理时间干预方法,通过调整选定头部(根据其逻辑模式识别)之间的注意力分数权重来实现。 AAI提供了一种有效的方法,以引导模型的推理利用先前的知识并通过调节注意力得分来增强这一点。广泛的实验表明,AAI在各种基准和模型架构上增强了逻辑推理性能,并且产生的额外计算开销几乎可以忽略不计。代码可以在提供的链接中获取。
https://arxiv.org/abs/2601.09805
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
视觉语言行动(VLA)任务要求在复杂多变的环境中对视觉场景进行推理并执行适应性动作。尽管最近关于VLA推理的研究表明,明确的思维链(CoT)可以提高泛化能力,但由于冗长的推理过程导致了较高的推断延迟。我们提出了一种高效的推理框架Fast-ThinkAct,通过可表达的潜在推理实现了紧凑而高性能的规划。Fast-ThinkAct 通过从教师模型中蒸馏学习,以偏好引导的目标为导向,将操作轨迹对齐,从而同时转移语言和视觉规划能力用于具身控制。这使得增强型策略学习能够有效连接简洁的推理与行动执行。在多个具有挑战性的具身操作和推理基准测试中的广泛实验表明,Fast-ThinkAct 在不牺牲长期规划、少量样本适应性及故障恢复性能的情况下,将最先进的VLA推理模型的推断延迟最多减少了89.3%。
https://arxiv.org/abs/2601.09708
Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
基于Transformer的语言模型在数学推理基准测试中通常能取得很好的结果,但在基本数值理解和算术运算方面仍然表现出脆弱性。一个主要的限制是数字被处理为符号标记,其嵌入并未明确编码数值大小,从而导致系统性的错误。我们引入了一种价值感知的数值表示方法,该方法通过添加专门的前缀令牌来增强标准分词输入,这个令牌的嵌入受到底层数值大小的影响。这种机制直接将数量级信息注入模型的输入空间,并且与现有的标记器和解码器专用Transformer架构兼容。 在算术任务上的评估表明,所提出的方法在不同数值格式、任务及操作数长度上均优于基线方法。这些结果表明,明确编码数值大小是提高语言模型基本数值稳健性的有效且高效的方式。
https://arxiv.org/abs/2601.09706
Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.
代码生成任务的目标是自动化将用户需求转换为可执行代码的过程,从而显著减少手动开发的工作量,并提高软件生产效率。大型语言模型(LLMs)的出现极大地推进了代码生成技术的进步,尽管其效率仍然受到某些内在架构限制的影响。每个标记生成都需要进行一次完整的推理过程,这要求在内存中持续保留上下文信息,并增加资源消耗。现有的研究主要集中在推理阶段的优化上,例如提示压缩和模型量化,而生成阶段则较少被探索。 为了解决这些挑战,我们提出了一种名为ShortCoder的知识融合框架,它能够在保持语义等价性和可读性的同时提高代码生成效率。具体而言,我们引入了以下内容: 1. 十个从抽象语法树(AST)保存转换中衍生出的Python级简化规则,实现了无功能损失的18.1%标记减少。 2. 一个混合数据合成管道,结合基于规则的重写和LLM引导优化,生成ShorterCodeBench,这是一个包含原始代码及其精简版本且具有语义一致性的验证元组集合。 3. 一种微调策略,将简洁性意识注入基础LLMs。 广泛的实验结果表明,与最先进的方法相比,ShortCoder在HumanEval上的性能始终更优,在保持代码生成质量的同时,实现了18.1%至37.8%的生成效率提升。
https://arxiv.org/abs/2601.09703
Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.
Segment Anything Model 3 (SAM3) 已经建立了一个强大的基础,能够稳健地检测、分割和跟踪视频中的指定目标。然而,在其原始实现中,由于其群组级别的集体记忆选择在复杂的多对象场景下表现不佳,因为它采用了一种基于所有并发目标平均性能的同步决策方法,往往忽视了个体对象的可靠性。为此,我们提出了 SAM3-DMS,这是一种无训练的解耦策略,利用细粒度的记忆选择针对每个单独的对象。实验表明,我们的方法实现了稳健的身份保持和跟踪稳定性。值得注意的是,随着目标密度的增加,我们的优势变得更加明显,为在野外同时进行多目标视频分割奠定了坚实的基础。
https://arxiv.org/abs/2601.09699
3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
基于扩散模型的现代视频生成模型能够产生非常逼真的片段,但它们在计算上效率低下,通常需要数分钟的GPU时间才能生成短短几秒钟的视频。这种低效性对那些需要实时互动的应用程序(如具身人工智能和虚拟现实/增强现实)部署生成式视频构成了关键障碍。本文探讨了一种新的策略:针对静态场景进行相机条件下的视频生成——使用基于扩散模型的生成器来产生一组稀疏的关键帧,然后通过3D重建和渲染合成完整的视频。我们的方法通过将这些关键帧提升到三维表示中,并渲染中间视图,在数百帧之间分摊了生成成本,同时保持了几何一致性。我们进一步引入了一种预测给定相机轨迹所需最优关键帧数量的模型,使系统能够自适应地分配计算资源。最终的方法SRENDER针对简单路径使用非常稀疏的关键帧,而对复杂摄像机运动则采用更密集的关键帧设置。这使得在生成20秒视频时,相较于基于扩散的基本线程,速度提高了40多倍,同时保持了视觉保真度和时间稳定性,为高效的可控视频合成提供了一条实用路径。
https://arxiv.org/abs/2601.09697
LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
大型语言模型(LLMs)正越来越多地被整合到临床工作流程中,但它们通常缺乏临床同理心,这是有效医患沟通的一个关键方面。现有的自然语言处理框架主要集中在反应性地标记医生回复中的同理心,但在预测一般健康查询所需的同理心需求方面支持有限。我们介绍了情感适用性框架(EAF),这是一个基于理论的方法,根据临床、上下文和语言线索来分类患者咨询中情感反应和解释的适用性。我们发布了一个由人类和GPT-4双重标注的真实患者咨询基准数据集。在达成人类共识的数据子集中,我们也观察到了显著的人类与GPT之间的一致性。为了验证EAF的有效性,我们在人工标记和仅GPT标记的数据上训练分类器来预测情感适用性,取得了优异的表现,并超越了启发式方法和零样本LLM基准线。错误分析突显了一些持续存在的挑战:隐含的压力、临床严重程度的模糊性和上下文上的困境,这强调了需要多标注者建模、临床医生在循环中的校准以及文化多样化的标注工作。EAF为识别响应生成前的情感需求提供了一个框架,建立了预测性同理心模型的基准,并支持异步医疗保健中的情感沟通。
https://arxiv.org/abs/2601.09696
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
随着大型语言模型(LLM)的规模不断扩大,后训练修剪作为一种有望在减少计算成本的同时保持性能的方法逐渐崭露头角。现有方法如SparseGPT和Wanda通过逐层权重重构或基于激活量化的幅度剪枝实现高稀疏度,但它们依赖于统一的手工启发式来确定每层的稀疏比率。此外,最近的研究表明,被修剪的LLM在事实性知识方面有显著下降,结构化修剪方法在这种情况下几乎完全丧失了事实问答能力。我们引入了代理引导剪枝,其中基础模型充当自适应剪枝代理,在每次迭代中智能地选择要修剪的层,同时保持关键的知识路径不变。我们的方法通过结合Wanda启发式的权重-激活度量与梯度重要性得分来构建逐层敏感度概况,并将其归一化为z分数以进行模型无关比较。这些统计信息由具有自我反思能力的LLM代理处理,使其能够从先前的修剪结果中学习并迭代优化其策略。一种检查点回滚机制通过在困惑度下降超过阈值时恢复来保持模型质量。我们在Qwen3模型(4B和8B参数)上大约以45%的稀疏度评估我们的方法,结果显示相对于结构化剪枝基线有显著改进:MMLU准确率提高了56%,FreebaseQA上的事实性知识保留提高了19倍,困惑度下降降低了69%。值得注意的是,我们的框架不需要重新训练,在模型无关的方式下运行,并且仅通过2-4次回滚在21-40次迭代中表现出有效的自我纠正能力,表明基础模型可以有效地指导其他基础模型的压缩过程。
https://arxiv.org/abs/2601.09694
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
大型语言模型(LLM)路由器能够动态地选择最适合给定输入的模型。现有方法通常假设可以访问带有真实标签的数据,但在实践中这种情况往往不可用,尤其是在用户请求分布异质且未知的情况下。我们引入了基于生成数据的路由(RGD),这是一种具有挑战性的设置,在该设置中,路由器仅通过由生成器LLM从高层次的任务描述中产生的查询和答案进行训练。 我们在四个多样化的基准测试集上对带有查询和标签的回答路由器以及仅使用查询的路由器进行了评估,并发现在12种模型中,当生成器质量降低时,回答路由器的表现下降速度比仅使用查询的路由器更快。我们的分析揭示了有效生成器的两个关键特性:它们必须准确地回应自己的问题,并且这些问题应该能够在模型池之间产生足够的性能差异。 接着,我们展示了如何通过过滤这些特征来提高生成数据的质量。此外,我们提出了CASCAL,这是一种新颖的仅使用查询的路由方法,它通过共识投票估计模型的正确性并通过分层聚类识别每个模型的具体技能领域(skill niches)。CASCAL对生成器质量具有显著的鲁棒性,在弱生成器数据上训练时,比最佳回答路由器高出4.6%的绝对准确率。
https://arxiv.org/abs/2601.09692
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
深度研究系统被广泛应用于多步骤的网页研究、分析以及跨来源综合,然而它们的评估仍然面临挑战。现有的基准测试通常需要密集的手动标注来构建任务,依赖静态的评价维度,或者在缺少引用时无法可靠地验证事实。为了弥补这些不足,我们介绍了DeepResearchEval,这是一个用于深度研究任务构建和代理式评估的自动化框架。 对于任务构建部分,我们提出了一种以人物角色驱动的工作流程,该工作流程能够生成基于多样化用户配置文件的真实且复杂的深度研究任务,并通过两个阶段的任务资格和搜索必要性过滤器来保留那些需要多来源证据整合以及外部检索的任务。 在评价方面,我们提出了一个代理式管道方案,包含两个组成部分:一个是自适应的点对点质量评估系统,该系统能够根据每个生成的任务动态地推导出具体化的评价维度、标准和权重;另一个是主动的事实核查机制,它能够在缺少引用的情况下通过网络搜索自主提取并验证报告中的陈述。
https://arxiv.org/abs/2601.09688