The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
大型多模态模型(LMMs)的出现显著增强了大型语言模型(LLMs)处理和解释多样化数据模式(如图像和视频)的能力。然而,随着输入复杂性的增加,尤其是在长视频序列的情况下,所需的token数量大幅增长,导致计算成本呈二次方增长。这使得在保持性能完整性的前提下高效压缩LMMs中的视频token成为一个紧迫的研究挑战。 在本文中,我们介绍了CrossLMM,通过双交叉注意力机制将长视频序列从LMMs中解耦,从而显著减少了视觉token的数量,并且几乎不会对性能产生负面影响。具体来说,我们首先通过对预训练的视觉编码器使用池化方法实现大量token的减少。然后,在LLM层内,我们采用了一种视觉到视觉的交叉注意力机制,其中池化的视觉tokens作为查询与原始视觉token集合进行比较。这一模块使得更高效的token利用成为可能,并且保持了细粒度的信息保真度。 此外,我们引入了一个文本到视觉的交叉注意力机制,在该机制中,文本tokens通过与原始视觉tokens互动而增强,从而丰富了对文本tokens的理解。 全面的经验评估表明,尽管使用了显著较少的计算资源,我们的方法在各种基于视频的LMM基准测试上实现了可比或更优的表现。
https://arxiv.org/abs/2505.17020
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.
最近的研究表明,通过基于规则的强化学习(RL)利用结果奖励可以成功地激发多模态大型语言模型(MLLMs)的强大推理能力。然而,这种范式通常缺乏对最终答案形成过程中的思维流程进行监督,因此模型可能会学到次优的推理策略,这会妨碍其泛化能力。为此,我们提出了SophiaVL-R1,试图在这个框架中为思考过程添加奖励信号。 为了实现这一目标,我们首先训练了一个评估整个思维过程质量的思维奖励模型。鉴于某些样本中的思维奖励可能由于奖励操控而不可靠,我们提出了一种Trust-GRPO方法,在此过程中根据导致正确答案和错误答案响应之间的思维奖励比较来分配一个可信度权重。这种方法有助于减轻潜在不准确思维奖励的影响。 此外,我们设计了一个退火训练策略,该策略会随着时间的推移逐渐减少思维奖励的重要性,使得模型在后期的训练阶段能够更多地依赖于准确的结果奖励。实验结果表明,我们的SophiaVL-R1在多个基准测试(如MathVisita、MMMU)上超越了一系列推理MLLMs,展示了强大的推理和泛化能力。特别值得注意的是,在大多数基准测试中,我们的参数量较少的SophiaVL-R1-7B模型甚至超过了具有10倍更多参数的LLaVA-OneVision-72B。 所有代码、模型和数据集均已公开发布在以下链接:[此链接需要您根据原文进行填写]。
https://arxiv.org/abs/2505.17018
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
最近的研究进展强调了强化学习(RL)在增强大型语言模型(LLMs)中的链式思维(CoT)推理能力方面的重要作用。两种突出的RL算法,直接偏好优化(DPO)和群体相对策略优化(GRPO),是这些发展中的核心,展示了各自的优点和缺点。自回归图像生成,也可以视为一种序列式的CoT推理过程,提出了不同于基于LLM的CoT推理的独特挑战。这些问题包括确保文本与图像的一致性、提高图像美学质量以及设计复杂的奖励模型,而不是依赖简单的规则基础奖励。虽然最近的努力已经将RL扩展到这一领域,但这些探索通常缺乏对该领域特定挑战和不同RL策略特性的深入分析。 为了填补这一空白,我们首次对自回归图像生成中GRPO和DPO算法进行了全面调查,评估它们在域内性能以及跨域泛化的能力,并审查不同的奖励模型对其能力的影响。我们的研究结果表明,GRPO和DPO展现了各自的独特优势,而且具备更强内在泛化能力的奖励模型可能会提升所应用RL算法的泛化潜力。此外,我们系统地探索了三种流行的扩展策略以增强它们在域内和跨域的能力,并对每个范式的性能扩展得出了独特的见解。 我们希望这项研究为未来开发更有效的RL算法开辟新的道路,在自回归图像生成领域实现稳健的CoT推理。代码可在[此处](https://this_https_URL)获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2505.17017
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
多模态大型语言模型(MLLMs)在视觉任务方面取得了迅速进展,但它们的空间理解能力仍局限于单张图像的范畴,这使得这些模型不太适合需要跨多帧进行推理的机器人技术和其他现实世界应用。为此,本文提出了一种框架,旨在通过整合深度感知、视觉对应和动态感知,增强MLLMs在处理多帧数据时的空间理解能力。 我们方法的核心是MultiSPA数据集,这是一个新颖且大规模的数据集合,包含超过2700万个样本,涵盖了各种3D和4D场景。除了MultiSPA数据集之外,我们还提出了一套全面的基准测试框架,在统一的标准下评估一系列空间任务的表现。我们的模型——多帧时空MLLM(Multi-SpatialMLLM),在与基线系统及专有系统的对比中取得了显著的进步,证明了其具备可扩展和通用性的跨多帧推理能力。 此外,我们还观察到了该模型在处理多种任务时的协同效应,并发现了它在面对挑战性场景时所展现出的新颖能力。最后,我们展示了我们的模型如何能够作为机器人技术中的多帧奖励标注器使用。
https://arxiv.org/abs/2505.17015
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
多模态大型语言模型(MLLMs)在问答任务中取得了令人印象深刻的成功,但它们的空间理解能力却鲜有探索。本研究探讨了一个关键问题:现有的MLLM是否具备三维空间感知和理解的能力?具体而言,本文作出了以下贡献: (i) 我们引入了VGBench,这是一个专门设计的基准测试工具,用于评估MLLM在视觉几何感知方面的能力,例如相机姿态和运动估计; (ii) 我们提出了SpatialScore,这是迄今为止最为全面且多样化的多模态空间理解基准测试,它将VGBench与来自其他11个现有数据集的相关数据进行了整合。该基准测试包含了28,000多个样本,涵盖了各种空间理解任务、模式以及问答格式,并包含了一个精心策划的挑战性子集SpatialScore-Hard; (iii) 我们开发了SpatialAgent,这是一个新型多代理系统,集成有9种专门的空间理解工具,支持计划-执行和反思行动(ReAct)推理范式; (iv) 我们进行了广泛评估,揭示了空间推理中持久存在的挑战,并展示了SpatialAgent的有效性。 我们认为,SpatialScore将提供宝贵的见解,并作为下一代MLLM演进的严格基准。
https://arxiv.org/abs/2505.17012
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
大型语言模型(LLM)非常强大,但由于静态知识的限制,它们容易产生幻觉。检索增强生成(RAG)通过注入外部信息来帮助解决这一问题,但目前的方法往往成本高昂、泛化能力差或忽视了模型内部的知识。在本文中,我们引入了一个名为 R1-Searcher++ 的新框架,旨在训练 LLM 自适应地利用内外部知识源。 R1-Searcher++ 采用两阶段的训练策略:初始的 SFT Cold-start 阶段用于初步格式学习,随后是使用结果监督进行动态知识获取的强化学习(RL)阶段。该 RL 阶段包括一个奖励机制来鼓励模型利用内部知识,并结合了记忆机制以持续吸收检索到的信息,从而丰富模型的内部知识。 通过整合内部知识和外部搜索引擎,R1-Searcher++ 模型能够不断提升其能力,实现高效的检索增强推理。我们的实验表明 R1-Searcher++ 超过了之前的 RAG 和推理方法,并实现了高效检索。代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.17005
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.
大型语言模型(LLMs)已被证明在复杂的逻辑推理任务中表现出突破性的性能。然而,大多数现有的研究集中在使用正式语言引导LLM进行可靠推导路径的开发上,而这些能力系统的评估仍然有限。在这篇论文中,我们的目标是利用形式化语言对各种逻辑推理问题进行全面评估大型语言模型(LLMs)的表现。 从三个维度来看,即LLMs的光谱、任务分类和轨迹格式,我们的重要发现包括: 1. 思维模型在使用正式语言时显著优于指令型模型; 2. 所有LLM在归纳推理能力上都存在局限性,无论是否使用正式语言; 3. 采用PoT(Proof of Trace)格式的数据在所有其他语言中实现了最佳的泛化性能。 此外,我们还整理了与形式化相关的训练数据,以进一步增强小型语言模型,并通过实验结果表明,简单的拒绝微调方法可以更好地使LLM跨不同正式语言进行泛化,并达到最佳的整体表现。我们的代码和报告可在该网址获取(请在原文链接处输入实际的URL)。
https://arxiv.org/abs/2505.16998
Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
近期在情感支持对话(ESC)领域取得的进展,通过监督微调(SFT)对大规模语言模型(LLMs)进行精细调整,提高了情感支持生成的质量。然而,常见的心理错误仍然存在。直接偏好优化(DPO)通过成对偏好学习来减少这些错误显示出潜力,但其在ESC任务中的有效性受到两个关键挑战的限制:(1) 数据结构复杂交织:现有的ESC数据固有地将心理策略和响应内容纠缠在一起,这使得构建高质量的偏好对变得困难;(2) 优化模糊性:将原始DPO应用于这种复杂的成对数据会导致训练目标不明确。为了解决这些问题,我们引入了推断式偏好挖掘(IPM)来构建高质量的偏好数据,并形成了IPM-PrefDial数据集。基于这些数据,我们提出了一个解耦ESC框架,该框架借鉴了格罗斯的情感调节扩展过程模型,将ESC任务分解为两个顺序子任务:策略规划和共鸣回应生成。每个子任务都通过SFT进行训练,并进一步通过DPO进行优化以符合心理偏好。广泛的实验表明,我们的解耦ESC框架优于联合优化基线,在减少偏好评分偏差的同时提升了响应质量。
https://arxiv.org/abs/2505.16995
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
在这项工作中,我们提出了Dimple,这是首个离散扩散多模态大型语言模型(DMLLM)。我们观察到,使用纯粹的离散扩散方法进行训练会导致显著的训练不稳定、次优性能和严重的长度偏差问题。为了应对这些挑战,我们设计了一种新的训练范式,该范式结合了初始自回归阶段与后续的扩散阶段。这种方法生成了Dimple-7B模型,其在相同的语料库上进行了训练,并使用了类似于LLaVA-NEXT的训练管道。最终,Dimple-7B以3.9%的优势超越了LLaVA-NEXT,这表明DMLLM可以实现与自回归模型相当的性能。 为了提高推理效率,我们提出了一种名为“自信解码”的解码策略,该策略在每个步骤中动态调整生成令牌的数量,显著减少了生成迭代次数。在自回归模型中,生成期间的前向迭代次数等于响应长度。然而,在使用自信解码时,Dimple所需的迭代次数仅为响应长度的$\frac{1}{3}$。 此外,我们重新实现了自回归模型中的填充技术,并展示了这种技术对大多数基准评估性能影响不大,但提供了1.5倍到7倍的速度提升。我们也探讨了Dimple利用结构先验精准控制其响应的能力。这些先验使得以不同于指令或链式思考提示的方式生成结构化回复成为可能,从而可以精确地控制回复格式和长度,而这在自回归模型中是难以实现的。 总的来说,这项工作验证了DMLLM的可行性和优势,并提高了它的推理效率和可控性。代码与模型可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2505.16990
Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
大型语言模型(LLMs)已经展示了作为智能代理解决复杂问题的卓越能力。然而,在涉及API或工具调用之间依赖关系的情景下——特别是在多轮对话中——进行有效规划仍然是一个重大的挑战。为了解决这个问题,我们推出了T1,这是一个增强型、跨领域、多轮会话的数据集,专门设计用于捕捉和管理不同领域的工具间的相互依赖性。T1通过集成的缓存机制(支持短期和长期记忆)帮助智能代理在九个不同的领域(包括4个单一领域和5个多领域)协调使用工具,并支持动态重新规划——例如决定是否重新计算或重用已缓存的结果。 除了促进关于工具使用和计划的研究外,T1还作为评估开源语言模型性能的基准。我们介绍了由T1-Agent提供支持的结果,展示了它们在复杂、依赖于工具的情景中进行规划和推理的能力。
https://arxiv.org/abs/2505.16986
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
训练后的微调已经证明了其在增强大规模语言模型(LLMs)推理能力方面的重要性。主要的微调方法可以分为监督微调(SFT)和强化微调(RFT)两大类。SFT因其效率高且适合小型语言模型而受到青睐,但可能导致过拟合,并限制大型模型的推理能力。相比之下,RFT通常能产生更好的泛化效果,但是高度依赖于基础模型的质量。为了克服SFT和RFT的局限性,我们提出了一种新的微调范式——统一微调(UFT),它将SFT和RFT整合为一个单一且集成的过程。UFT使模型能够有效地探索解决方案,同时融入信息丰富的监督信号,弥合了现有方法中记忆与思考之间的差距。值得注意的是,无论模型大小如何,UFT在总体上都优于SFT和RFT。此外,我们从理论上证明了UFT突破了RFT内在的指数级样本复杂性瓶颈,首次展示了统一训练能够在长时态推理任务上实现指数级加速收敛的效果。
https://arxiv.org/abs/2505.16984
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.
大型语言模型(LLMs)主要设计用于批处理。现有将LLMs适应流处理的方法要么依赖昂贵的重新编码,要么依赖具有有限可扩展性的专用架构。本研究识别了在将面向批次的LLM调整为流模式时出现的三个关键不匹配:(1) 输入注意机制不匹配;(2) 输出注意机制不匹配;以及 (3) 位置ID不匹配。尽管通常认为后两种不匹配需要频繁重新编码,但我们的分析表明,只有输入注意机制不匹配显著影响性能,这意味着输出的重新编码在很大程度上是不必要的。为了更好地理解这种与普遍假设之间的差异,我们首次对LLMs在流处理中的位置编码影响进行了全面分析,显示保持源和目标上下文内的相对位置比维持绝对顺序更为关键。 受到上述分析的启发,我们引入了一种基于批处理架构的组位置编码范式,以增强流模式与批处理模式之间的一致性。跨语言和跨模态任务上的大量实验表明,我们的方法优于现有方法,并且无需对架构进行修改,在流模式和批处理模式下都表现出强大的泛化能力。代码可在以下网址获取:[此链接](请将“this https URL”替换为实际的URL)。
https://arxiv.org/abs/2505.16983
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
语法在自然语言处理和文本/代码生成中扮演着关键角色,它能够定义句法、创建解析器,并指导结构化输出。尽管大型语言模型(LLMs)在其广泛的应用领域表现出令人印象深刻的能力,但它们推断和生成语法规则的能力尚未得到充分探索。在这篇论文中,我们旨在研究并改进LLM在小样本语法生成中的能力,在这种情况下,从一组少量的正例和反例中推导出语法,并将其以Backus-Naur形式(BNF)生成出来。为了探究这一点,我们引入了一个包含540个结构化语法生成挑战的新数据集,设计了6种评估指标,并对8种不同的LLM进行了评测。我们的研究发现表明,现有的LLM在语法生成方面表现不佳。为了解决这个问题,我们提出了一种新的方法——HyGenar,这是一种由LLM驱动的混合遗传算法,旨在优化语法规则的生成过程。HyGenar显著提高了不同LLM在语法生成中的句法和语义正确性。
https://arxiv.org/abs/2505.16978
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{this https URL}{this https URL}.
大型语言模型(LLMs)在各种软件工程任务中展现了强大的能力,例如代码补全、错误修复和文档生成。然而,特性驱动开发(FDD),这是一种高度流行的真实世界任务,涉及到为庞大的现有代码库添加新功能,这一领域仍然被较少探索。为此,我们引入了SWE-Dev,这是首个大规模数据集(包含14,000个训练样本和500个测试样本),旨在评估和训练自动编码系统在真实世界的特性开发任务上的表现。为了确保可验证且多样化的训练过程,SWE-Dev独特地为所有实例提供了一个运行环境及其由开发者编写的执行单元测试。 这个数据集不仅提供了高质量的数据用于监督微调(SFT),而且还通过提供来自可执行单元测试的准确奖励信号支持强化学习(RL)。我们在SWE-Dev上进行了广泛评估,涵盖了17个聊天机器人LLM、10个推理模型和10个多智能体系统(MAS),发现FDD是当前AI面临的深刻挑战前沿(例如,Claude-3.7-Sonnet在困难测试分割上的Pass@3仅达到22.45%)。至关重要的是,我们展示了SWE-Dev作为一个有效的模型改进平台的作用:在训练集上进行微调使一个70亿参数的模型在“困难”分段的表现可媲美GPT-4o,强调了其高质量训练数据的价值。 代码可以在[\href{this https URL}{此处}]获取。
https://arxiv.org/abs/2505.16975
We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \href{this https URL}{\textcolor{blue}{HuggingFace}}, with code at \href{this https URL}{\textcolor{blue}{GitHub}}.
我们介绍了一种名为\texttt{CASS}的大型数据集和模型套件,用于跨架构GPU代码转译,旨在支持源码级(CUDA~$\leftrightarrow$~HIP)和汇编级(Nvidia SASS~$\leftrightarrow$~AMD RDNA3)翻译。该数据集包含了70,000个经过验证的代码对,涵盖了主机端和设备端,填补了低级别GPU代码可移植性中的关键空白。利用这一资源,我们训练了\texttt{CASS}系列特定领域的语言模型,在源码转译准确率上达到了95%,汇编级转译准确率达到37.5%。这些性能远超商业基准如GPT-4o、Claude和Hipify的水平。在超过85%的测试案例中,我们生成的代码能够匹配原生性能,并保持了运行时间和内存行为的一致性。 为了支持严格的评估,我们引入了\texttt{CASS-Bench},这是一个经过精心挑选的基准集,涵盖了16个GPU领域并且拥有真实的执行结果。所有的数据、模型和评估工具都作为开源项目发布,以促进在GPU编译器工具开发、二进制兼容性以及LLM(大型语言模型)指导硬件翻译方面的进步。 该数据集与基准可以在\href{this https URL}{HuggingFace}上找到,代码托管于\href{this https URL}{GitHub}。
https://arxiv.org/abs/2505.16968
Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP). This extension could introduce new security vulnerabilities. We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. We evaluate two critical attack scenarios: (1) "malicious content relay" and (2) "sensitive data leakage" through MCP-enabled tools. Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content.
大型语言模型(LLMs)正越来越多地配备实时网络搜索功能,并集成了诸如模型上下文协议(MCP)等协议。这种扩展可能会引入新的安全漏洞。我们系统地研究了通过恶意字体注入外部资源(如网页),攻击者操纵代码到字形映射,从而将用户不可见的欺骗性内容注入大型语言模型时所导致的安全隐患。我们评估了两个关键的攻击场景:(1) “恶意内容中继”和 (2) 通过启用MCP工具的“敏感数据泄露”。我们的实验表明,带有注入恶意字体的间接提示可以通过外部资源绕过LLMs的安全机制,并根据数据敏感性和提示设计的不同而获得不同程度的成功率。我们的研究强调了在处理外部内容时加强大型语言模型部署安全措施的紧迫性。
https://arxiv.org/abs/2505.16957
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
在这篇论文中,我们结合了两步知识蒸馏、结构化剪枝、截断和词汇精简技术,以极大压缩低资源语言的多语种编码器模型。我们的创新方法系统性地整合现有技术并推向极限,减少层数深度、前向反馈隐藏层大小以及中间层嵌入尺寸,从而创建显著更小的语言特定单语模型,同时保留了关键的语言特性知识。我们在包括情感分析、主题分类、命名实体识别和词性标注在内的四个下游任务中,在三种低资源语言上实现了高达92%的压缩率,并且仅在性能上有微不足道的下降(2-10%)。值得注意的是,性能下降的程度与教师模型中的语言特定数据量相关,较大的数据集导致较小的性能损失。此外,我们进行了详尽的消融研究,以确定使用这些技术进行多语种模型压缩的最佳实践。
https://arxiv.org/abs/2505.16956
Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.
尽管大型语言模型具有令人印象深刻的性能,它们在训练数据分布之外的泛化能力仍然有限,常常表现出复杂的模式插值而非真正的抽象推理(外推)。在这项工作中,我们通过信息瓶颈(IB)理论来探讨这一限制。IB 理论认为,模型的泛化能力源自输入压缩与潜在表示中保留预测信息之间的最优平衡。使用 IB 理论,我们证明解码器单独的 Transformer 在形成任务优化序列表示时存在固有局限性。基于此结果,我们进一步展示周期性的全局转换内部序列级表示(KV 缓存)是提升 Transformer 在推理任务泛化能力的关键计算步骤。 根据这些理论见解,我们提出了一种对Transformer架构进行修改的方法,即添加一个额外的模块,在周期间隔内全局重写 KV 缓存,使其容量从记忆输入前缀向编码预测未来标记最相关的特征转变。我们的模型在数学推理基准测试中取得了显著的优势,优于参数量最多高达3.5倍的标准 Transformer 模型,以及用于缓存压缩的启发式驱动剪枝机制。 这种方法可以被视为现有 KV 缓存压缩方法的原理化扩展;虽然这些方法仅专注于压缩输入表示,但往往以牺牲保留预测信息为代价,因此其能力本质上受到无约束模型能力的限制。这建立了一个基于信息理论操作Transformer内存的原则性框架,解决了单纯通过扩大规模无法克服的根本推理局限性。
https://arxiv.org/abs/2505.16950