Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
大型语言模型(LLM)非常强大,但由于静态知识的限制,它们容易产生幻觉。检索增强生成(RAG)通过注入外部信息来帮助解决这一问题,但目前的方法往往成本高昂、泛化能力差或忽视了模型内部的知识。在本文中,我们引入了一个名为 R1-Searcher++ 的新框架,旨在训练 LLM 自适应地利用内外部知识源。 R1-Searcher++ 采用两阶段的训练策略:初始的 SFT Cold-start 阶段用于初步格式学习,随后是使用结果监督进行动态知识获取的强化学习(RL)阶段。该 RL 阶段包括一个奖励机制来鼓励模型利用内部知识,并结合了记忆机制以持续吸收检索到的信息,从而丰富模型的内部知识。 通过整合内部知识和外部搜索引擎,R1-Searcher++ 模型能够不断提升其能力,实现高效的检索增强推理。我们的实验表明 R1-Searcher++ 超过了之前的 RAG 和推理方法,并实现了高效检索。代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.17005
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL
我们提出了一种基于偏微分方程(PDE)逆问题的条件采样通用框架,旨在从极度稀疏或噪声较大的测量数据中恢复完整的解。此目标通过函数空间扩散模型和插件播放指导来实现条件设置。我们的方法首先使用神经算子架构训练一个无条件且与离散化无关的去噪模型。在推断阶段,我们利用基于梯度的引导机制细化样本以满足稀疏观测数据的要求。通过严格的数学分析,我们将Tweedie公式扩展到无限维希尔伯特空间中,为我们的后验采样方法提供了理论基础。我们的方法(FunDPS)在极小监督和极端数据稀缺条件下准确捕捉函数空间中的后验分布。在五项仅包含3%观测数据的PDE任务上,与最先进的固定分辨率扩散基准相比,我们的方法平均提高了32%的准确性,并将采样步骤减少了4倍。此外,多分辨率微调确保了强大的跨分辨率泛化能力。据我们所知,这是首个独立于离散化的基于扩散的方法框架,为偏微分方程上下文中的正向和逆向问题提供了一种实用且灵活的解决方案。代码可在此网址获得:[请在这里插入实际链接]
https://arxiv.org/abs/2505.17004
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.
大型语言模型(LLMs)已被证明在复杂的逻辑推理任务中表现出突破性的性能。然而,大多数现有的研究集中在使用正式语言引导LLM进行可靠推导路径的开发上,而这些能力系统的评估仍然有限。在这篇论文中,我们的目标是利用形式化语言对各种逻辑推理问题进行全面评估大型语言模型(LLMs)的表现。 从三个维度来看,即LLMs的光谱、任务分类和轨迹格式,我们的重要发现包括: 1. 思维模型在使用正式语言时显著优于指令型模型; 2. 所有LLM在归纳推理能力上都存在局限性,无论是否使用正式语言; 3. 采用PoT(Proof of Trace)格式的数据在所有其他语言中实现了最佳的泛化性能。 此外,我们还整理了与形式化相关的训练数据,以进一步增强小型语言模型,并通过实验结果表明,简单的拒绝微调方法可以更好地使LLM跨不同正式语言进行泛化,并达到最佳的整体表现。我们的代码和报告可在该网址获取(请在原文链接处输入实际的URL)。
https://arxiv.org/abs/2505.16998
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
基于大型语言模型(LLM)的多智能体系统(MAS)通过允许多个专业化代理之间的合作,扩展了单一LLM的能力。然而,大多数现有的MAS框架依赖于单个LLM来驱动所有代理,从而限制了系统的智能水平到该模型的极限。本文探讨了一种异构大型语言模型驱动的多智能体系统(X-MAS)范式,在这种系统中,各个代理由不同的大型语言模型提供动力,将系统的潜力提升到了多样化的大型语言模型集体智慧的高度。我们介绍了X-MAS-Bench,这是一个全面的测试平台,旨在评估各种LLM在不同领域和MAS相关功能上的表现。作为一项广泛的经验研究,我们在五个领域(涵盖21个测试集)和五种功能上对27种不同的LLM进行了超过170万次评估,以识别每个域-功能组合的最佳模型选择。基于这些发现,我们展示了从同质到异构大型语言模型驱动的多智能体系统的转变可以在不进行结构性重新设计的情况下显著提升系统性能。具体而言,在仅限于聊天机器人的MAS场景中,异构配置在MATH数据集上的表现提高了最多8.4%。在一个混合了聊天机器人和推理者的场景中,异构MAS在AIME数据集上实现了令人瞩目的47%的表现提升。我们的结果强调了异构大型语言模型在多智能体系统中的变革潜力,并为开发可扩展、协作的人工智能系统开辟了一条前景光明的道路。
https://arxiv.org/abs/2505.16997
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at this https URL.
大型推荐模型已经通过编码或项目生成将大规模语言模型(LLM)扩展为强大的推荐系统,并且最近在LLM推理方面的突破同步激励了在推荐领域中探索推理能力。目前的研究通常将LLM定位为外部推理模块,以提供辅助思考来增强传统的推荐流水线。然而,这种解耦设计由于显著的资源成本和次优联合优化而受到限制。为了应对这些问题,我们提出了一个统一的大规模推荐模型\name,该模型具备内在的推理能力。 首先,我们重新构想模型架构,使推理与推荐在自回归过程中能够交错进行。随后,我们提出RecPO框架,这是一个对应的强化学习框架,它通过单一策略更新同时优化\name的推理和推荐能力;RecPO引入了一种融合奖励方案,仅利用推荐标签来模拟推理能力,消除了对专门推理注释的依赖。 在三个具有不同基线的数据集上进行的实验验证了\name的有效性,显示其Hit@5(点击率)相对提高了68.67%,NDCG@20(归一化折扣累积增益)相对提高了45.21%。代码可在提供的URL获取。 这一段文字描述了一个名为“\name”的新模型及其配套的强化学习框架RecPO的设计和效果,该系统旨在改进大型推荐模型通过内置推理能力来优化推荐性能的方法,并展示了其在实验中的优越表现。
https://arxiv.org/abs/2505.16994
LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
基于大型语言模型(LLM)的多智能体系统(MAS)在实际应用中展现出提升单一LLM能力以应对复杂和多样化任务的巨大潜力。尽管取得了一定进展,该领域仍缺乏一个统一的代码库来整合现有方法,这导致了重复实现的努力、不公平的比较以及研究人员较高的入门门槛。为解决这些问题,我们引入了MASLab——一个统一、全面且适合研究者使用的基于LLM的MAS代码库。 1. MASLab集成了超过20种跨多个领域的成熟方法,并通过与官方实现逐步骤输出对比的方式严谨验证每一种方法。 2. MASLab提供了一个统一的环境,包括多种基准测试以进行公平的方法比较,确保一致的输入和标准化的评估协议。 3. MASLab在共享的精简结构中实现了各种方法,降低了理解和扩展的门槛。基于MASLab,我们进行了广泛的实验,覆盖了10多个基准测试和8种模型,为研究人员提供了对当前MAS方法领域清晰全面的观点。 未来,MASLab将继续发展,跟踪该领域的最新进展,并邀请更广泛开源社区的贡献。
https://arxiv.org/abs/2505.16988
Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
大型语言模型(LLMs)已经展示了作为智能代理解决复杂问题的卓越能力。然而,在涉及API或工具调用之间依赖关系的情景下——特别是在多轮对话中——进行有效规划仍然是一个重大的挑战。为了解决这个问题,我们推出了T1,这是一个增强型、跨领域、多轮会话的数据集,专门设计用于捕捉和管理不同领域的工具间的相互依赖性。T1通过集成的缓存机制(支持短期和长期记忆)帮助智能代理在九个不同的领域(包括4个单一领域和5个多领域)协调使用工具,并支持动态重新规划——例如决定是否重新计算或重用已缓存的结果。 除了促进关于工具使用和计划的研究外,T1还作为评估开源语言模型性能的基准。我们介绍了由T1-Agent提供支持的结果,展示了它们在复杂、依赖于工具的情景中进行规划和推理的能力。
https://arxiv.org/abs/2505.16986
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
单一代理LLM面临硬性限制——有限的上下文、角色过载和脆弱的知识领域转移。传统多代理解决方案虽然减轻了这些问题,但也暴露出新的问题:不恰当的任务分解、模糊不清的合作协议以及验证成本高昂,削弱了改进效果。因此,我们提出了一种名为“掌握诀窍”(Know-The-Ropes, KtR)的框架,该框架将领域的先验知识转化为算法蓝图层级结构,在这种结构中,任务被递归地拆分为有类型的、由控制器中介的子任务,每个子任务要么直接解决,要么通过最轻量级的方法进行增强(例如:思维链推理、微调或自我检查)。基于“没有免费午餐”的定理,KtR放弃了寻找通用提示符的努力,转而强调有条不紊的任务分解。 在背包问题(3-8个物品)上,使用三个GPT-4o-mini代理,在补全单一瓶颈代理后,从零样本的3%准确率提高到大小为5的情况下的95%。对于更具挑战性的任务分配问题(6-15项工作),一个由六个o3-mini蓝图组成的系统在规模达到10时能够实现100%的正确率,并且在规模13-15时也能保持84%的准确度,相比之下零样本情况下的准确率为11%。 通过算法意识的任务分解加上有针对性的增强,这种框架使中等大小的模型成为可靠的合作伙伴——无需构建越来越大、越来越复杂的单一代理系统。
https://arxiv.org/abs/2505.16979
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
语法在自然语言处理和文本/代码生成中扮演着关键角色,它能够定义句法、创建解析器,并指导结构化输出。尽管大型语言模型(LLMs)在其广泛的应用领域表现出令人印象深刻的能力,但它们推断和生成语法规则的能力尚未得到充分探索。在这篇论文中,我们旨在研究并改进LLM在小样本语法生成中的能力,在这种情况下,从一组少量的正例和反例中推导出语法,并将其以Backus-Naur形式(BNF)生成出来。为了探究这一点,我们引入了一个包含540个结构化语法生成挑战的新数据集,设计了6种评估指标,并对8种不同的LLM进行了评测。我们的研究发现表明,现有的LLM在语法生成方面表现不佳。为了解决这个问题,我们提出了一种新的方法——HyGenar,这是一种由LLM驱动的混合遗传算法,旨在优化语法规则的生成过程。HyGenar显著提高了不同LLM在语法生成中的句法和语义正确性。
https://arxiv.org/abs/2505.16978
We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \href{this https URL}{\textcolor{blue}{HuggingFace}}, with code at \href{this https URL}{\textcolor{blue}{GitHub}}.
我们介绍了一种名为\texttt{CASS}的大型数据集和模型套件,用于跨架构GPU代码转译,旨在支持源码级(CUDA~$\leftrightarrow$~HIP)和汇编级(Nvidia SASS~$\leftrightarrow$~AMD RDNA3)翻译。该数据集包含了70,000个经过验证的代码对,涵盖了主机端和设备端,填补了低级别GPU代码可移植性中的关键空白。利用这一资源,我们训练了\texttt{CASS}系列特定领域的语言模型,在源码转译准确率上达到了95%,汇编级转译准确率达到37.5%。这些性能远超商业基准如GPT-4o、Claude和Hipify的水平。在超过85%的测试案例中,我们生成的代码能够匹配原生性能,并保持了运行时间和内存行为的一致性。 为了支持严格的评估,我们引入了\texttt{CASS-Bench},这是一个经过精心挑选的基准集,涵盖了16个GPU领域并且拥有真实的执行结果。所有的数据、模型和评估工具都作为开源项目发布,以促进在GPU编译器工具开发、二进制兼容性以及LLM(大型语言模型)指导硬件翻译方面的进步。 该数据集与基准可以在\href{this https URL}{HuggingFace}上找到,代码托管于\href{this https URL}{GitHub}。
https://arxiv.org/abs/2505.16968
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
训练鲁棒的检索和重排序模型通常依赖大规模的检索数据集;例如,BGE集合包含了从各种来源获取的160万条查询-段落对。然而,我们发现某些数据集会对模型效果产生负面影响——移除BGE集合中的8个数据集后,训练集规模减小了2.35倍,但在BEIR上的nDCG@10评分提高了1分。这促使我们更深入地考察训练数据的质量,并特别关注“假阴性”,即相关段落被错误地标记为不相关的案例。为此,我们提出了一种简单且成本效益高的方法:使用级联大语言模型提示来识别和重新标注难例(hard negatives)。实验结果显示,在BEIR上对E5(基础)和Qwen2.5-7B检索模型进行假阴性到真阳性的重标后,nDCG@10评分提高了0.7至1.4分;而在零样本的AIR-Bench评估中,此操作使得分提升了1.7至1.8分。对于在重新标注数据上微调的重排序器模型(如Qwen2.5-3B),也观察到了类似的性能提升效果。此外,级联设计的有效性还通过人工注释结果得到了进一步的支持:我们发现GPT-4o在判断方面的准确性与人类评价高度一致,而其简化版GPT-4o-mini则不具备这一特性。
https://arxiv.org/abs/2505.16967
Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications. In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation. Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar. This is achieved through belief propagation on the carefully constructed graphical models. Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches.
基于句子语义意义的文本分段是一项具有广泛应用价值的基本任务。在本文中,我们提出了一种基于图模型的无监督学习方法,命名为BP-Seg,用于高效的文本分割。我们的方法不仅考虑了局部连贯性(即相邻句子通常关系更紧密),还能够有效归类那些虽然在文本中相距较远但语义相似的句子。这是通过在精心构建的图形模型上进行信念传播来实现的。实验结果表明,在一个示例数据集和包含长篇文档的数据集上,我们的方法相较于竞争方法表现出色。
https://arxiv.org/abs/2505.16965
Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP). This extension could introduce new security vulnerabilities. We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. We evaluate two critical attack scenarios: (1) "malicious content relay" and (2) "sensitive data leakage" through MCP-enabled tools. Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content.
大型语言模型(LLMs)正越来越多地配备实时网络搜索功能,并集成了诸如模型上下文协议(MCP)等协议。这种扩展可能会引入新的安全漏洞。我们系统地研究了通过恶意字体注入外部资源(如网页),攻击者操纵代码到字形映射,从而将用户不可见的欺骗性内容注入大型语言模型时所导致的安全隐患。我们评估了两个关键的攻击场景:(1) “恶意内容中继”和 (2) 通过启用MCP工具的“敏感数据泄露”。我们的实验表明,带有注入恶意字体的间接提示可以通过外部资源绕过LLMs的安全机制,并根据数据敏感性和提示设计的不同而获得不同程度的成功率。我们的研究强调了在处理外部内容时加强大型语言模型部署安全措施的紧迫性。
https://arxiv.org/abs/2505.16957
Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.
尽管大型语言模型具有令人印象深刻的性能,它们在训练数据分布之外的泛化能力仍然有限,常常表现出复杂的模式插值而非真正的抽象推理(外推)。在这项工作中,我们通过信息瓶颈(IB)理论来探讨这一限制。IB 理论认为,模型的泛化能力源自输入压缩与潜在表示中保留预测信息之间的最优平衡。使用 IB 理论,我们证明解码器单独的 Transformer 在形成任务优化序列表示时存在固有局限性。基于此结果,我们进一步展示周期性的全局转换内部序列级表示(KV 缓存)是提升 Transformer 在推理任务泛化能力的关键计算步骤。 根据这些理论见解,我们提出了一种对Transformer架构进行修改的方法,即添加一个额外的模块,在周期间隔内全局重写 KV 缓存,使其容量从记忆输入前缀向编码预测未来标记最相关的特征转变。我们的模型在数学推理基准测试中取得了显著的优势,优于参数量最多高达3.5倍的标准 Transformer 模型,以及用于缓存压缩的启发式驱动剪枝机制。 这种方法可以被视为现有 KV 缓存压缩方法的原理化扩展;虽然这些方法仅专注于压缩输入表示,但往往以牺牲保留预测信息为代价,因此其能力本质上受到无约束模型能力的限制。这建立了一个基于信息理论操作Transformer内存的原则性框架,解决了单纯通过扩大规模无法克服的根本推理局限性。
https://arxiv.org/abs/2505.16950
Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at this https URL.
尽管在大型语言模型(LLM)的安全性和对齐方面近期做出了不少努力,但目前针对前沿LLM的对抗性攻击仍然能够持续生成有害内容。虽然对抗训练已被广泛研究,并被证明可以显著提高传统机器学习模型的鲁棒性,但在LLM上下文中其优势和局限性却不太为人所知。具体而言,尽管现有的离散对抗性攻击在产生有害内容方面效果很好,但使用具体的对抗提示来训练LLM通常计算成本高昂,导致依赖于连续松弛方法。由于这些松弛方法并不对应于离散输入标记,这样的潜在训练方式常常使模型易受一系列不同的离散攻击。 为了解决这一问题,我们提出了MixAT(混合对抗训练),这是一种结合了更强的离散和更快的连续攻击的新颖方法,在训练过程中加以应用。我们在广泛的前沿攻击上严格评估了MixAT,并提出了一种At Least One Attack Success Rate (ALO-ASR)指标来捕捉模型在最坏情况下的脆弱性。我们展示了MixAT达到了显著更好的鲁棒性(ALO-ASR小于20%),相较于之前的防御措施(ALO-ASR大于50%)而言,同时保持了与基于连续松弛的方法相似的运行时间。此外,我们在真实的部署场景下进一步分析了MixAT的表现,探讨了聊天模板、量化、低秩适配器和温度等因素如何影响对抗训练及评估,揭示了当前方法论中的更多盲点。 我们的结果表明,MixAT通过离散-连续防御提供了一个原则性的、更优的鲁棒性与准确性权衡,并且计算开销极小。这进一步强调了其在构建安全LLM方面的潜力。我们在[此处提供代码和模型](https://this.is.the.url.for.code.and.models/)。
https://arxiv.org/abs/2505.16947
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.
大型语言模型(LLMs)在现实世界的代理应用中展现了先进的能力。随着越来越多的研究努力致力于开发基于LLM的代理以解决实际需求,一个新的挑战出现了:代理场景往往涉及具有复杂约束条件的长指令,例如扩展系统提示和详细的工具规格说明。虽然遵循此类指令对代理应用程序至关重要,但LLMs是否能够可靠地遵守这些指示仍有待探索。在本文中,我们介绍了AgentIF,这是第一个用于系统评估LLM在代理场景中的指令跟随能力的基准测试。 AgentIF具有三个关键特性: 1. 现实性:构建自50个真实世界的代理应用。 2. 长度:平均长度为1,723字,最长可达15,630字。 3. 复杂性:每条指令平均包含11.9个约束条件,涵盖工具规格和条件约束等多样化的约束类型。 为了构建AgentIF,我们收集了来自工业应用代理和开源代理系统的人工标注的707条指令,涉及50项任务。对于每个指令,我们都标注了相关的约束条件以及相应的评估指标,包括基于代码、基于LLM和混合代码-LLM的评估方法。我们使用AgentIF对现有的高级LLMs进行了系统的评估,并观察到当前模型在处理复杂的约束结构和工具规格时表现不佳。 此外,我们还进行了错误分析及指令长度和元约束条件的分析性实验,提供了一些关于现有LLMs失败模式的见解。我们已发布了代码和数据以促进未来的研究。
https://arxiv.org/abs/2505.16944
Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.
基础模型在医疗保健领域展现出巨大的潜力,这是因为它们能够提取出与具体下游任务无关的有意义表示。这种特性使得这些模型即使在标注数据有限的情况下(这是医疗保健领域的常见挑战),也能在基于结构化电子健康记录(EHR)数据训练的多个临床应用中实现最先进的性能表现。然而,由于缺乏全面且有意义的任务标准以及足够多样化的评估方法来表征其相对于传统监督学习的优势,这些模型在临床实用性方面的潜力仍然存在争议。 为了弥补这一差距,我们提出了一套涵盖患者结果、急性病和慢性疾病的早期预测等具有临床意义的任务,并制定了稳健评价的标准。我们在哥伦比亚大学欧文医学中心(CUMC)提供的包含500万患者的EHR数据集上对最先进基础模型进行了评估,该数据来自纽约市的一个大型城市学术医疗中心。我们针对14个相关的临床任务进行了测试,测量了整体准确性、校准性和不同亚群体的表现,以揭示基于预训练策略、标记化和数据表示方法选择的权衡。 我们的研究旨在推进结构化EHR基础模型的经验评估,并为未来健康保健领域基础模型的发展提供指导。
https://arxiv.org/abs/2505.16941
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。
https://arxiv.org/abs/2505.16932