Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
视觉生成模型在从文本提示创建逼真的图像方面取得了显著进展,但处理包含多个对象及其精确空间关系和属性的复杂提示时仍面临挑战。有效应对这些复杂提示需要对语义内容和空间布局进行明确推理。我们提出了GoT-R1框架,该框架利用强化学习来增强视觉生成中的语义-空间推理能力。基于生成链式思维方法,GoT-R1使模型能够自主发现超越预定义模板的有效推理策略,这得益于精心设计的强化学习机制。 为了实现这一点,我们提出了一种双阶段多维度奖励框架,该框架利用大规模语言模型(MLLMs)来评估推理过程和最终输出,从而在整个生成管道中提供有效的监督。奖赏系统以统一的方式评估语义一致性、空间准确性以及视觉质量。 实验结果显示,在涉及精确空间关系和属性绑定的组合任务上,GoT-R1在T2I-CompBench基准测试中取得了显著改进,尤其是在处理复杂的组成性任务方面。GoT-R1通过成功将高级推理能力转移到图像生成领域,推动了这一领域的最新研究进展。 为了促进未来的相关研究工作,我们已在[此链接](https://此链接提供具体URL)上公开发布了代码和预训练模型。
https://arxiv.org/abs/2505.17022
The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
大型多模态模型(LMMs)的出现显著增强了大型语言模型(LLMs)处理和解释多样化数据模式(如图像和视频)的能力。然而,随着输入复杂性的增加,尤其是在长视频序列的情况下,所需的token数量大幅增长,导致计算成本呈二次方增长。这使得在保持性能完整性的前提下高效压缩LMMs中的视频token成为一个紧迫的研究挑战。 在本文中,我们介绍了CrossLMM,通过双交叉注意力机制将长视频序列从LMMs中解耦,从而显著减少了视觉token的数量,并且几乎不会对性能产生负面影响。具体来说,我们首先通过对预训练的视觉编码器使用池化方法实现大量token的减少。然后,在LLM层内,我们采用了一种视觉到视觉的交叉注意力机制,其中池化的视觉tokens作为查询与原始视觉token集合进行比较。这一模块使得更高效的token利用成为可能,并且保持了细粒度的信息保真度。 此外,我们引入了一个文本到视觉的交叉注意力机制,在该机制中,文本tokens通过与原始视觉tokens互动而增强,从而丰富了对文本tokens的理解。 全面的经验评估表明,尽管使用了显著较少的计算资源,我们的方法在各种基于视频的LMM基准测试上实现了可比或更优的表现。
https://arxiv.org/abs/2505.17020
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.
最近的研究表明,通过基于规则的强化学习(RL)利用结果奖励可以成功地激发多模态大型语言模型(MLLMs)的强大推理能力。然而,这种范式通常缺乏对最终答案形成过程中的思维流程进行监督,因此模型可能会学到次优的推理策略,这会妨碍其泛化能力。为此,我们提出了SophiaVL-R1,试图在这个框架中为思考过程添加奖励信号。 为了实现这一目标,我们首先训练了一个评估整个思维过程质量的思维奖励模型。鉴于某些样本中的思维奖励可能由于奖励操控而不可靠,我们提出了一种Trust-GRPO方法,在此过程中根据导致正确答案和错误答案响应之间的思维奖励比较来分配一个可信度权重。这种方法有助于减轻潜在不准确思维奖励的影响。 此外,我们设计了一个退火训练策略,该策略会随着时间的推移逐渐减少思维奖励的重要性,使得模型在后期的训练阶段能够更多地依赖于准确的结果奖励。实验结果表明,我们的SophiaVL-R1在多个基准测试(如MathVisita、MMMU)上超越了一系列推理MLLMs,展示了强大的推理和泛化能力。特别值得注意的是,在大多数基准测试中,我们的参数量较少的SophiaVL-R1-7B模型甚至超过了具有10倍更多参数的LLaVA-OneVision-72B。 所有代码、模型和数据集均已公开发布在以下链接:[此链接需要您根据原文进行填写]。
https://arxiv.org/abs/2505.17018
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
最近的研究进展强调了强化学习(RL)在增强大型语言模型(LLMs)中的链式思维(CoT)推理能力方面的重要作用。两种突出的RL算法,直接偏好优化(DPO)和群体相对策略优化(GRPO),是这些发展中的核心,展示了各自的优点和缺点。自回归图像生成,也可以视为一种序列式的CoT推理过程,提出了不同于基于LLM的CoT推理的独特挑战。这些问题包括确保文本与图像的一致性、提高图像美学质量以及设计复杂的奖励模型,而不是依赖简单的规则基础奖励。虽然最近的努力已经将RL扩展到这一领域,但这些探索通常缺乏对该领域特定挑战和不同RL策略特性的深入分析。 为了填补这一空白,我们首次对自回归图像生成中GRPO和DPO算法进行了全面调查,评估它们在域内性能以及跨域泛化的能力,并审查不同的奖励模型对其能力的影响。我们的研究结果表明,GRPO和DPO展现了各自的独特优势,而且具备更强内在泛化能力的奖励模型可能会提升所应用RL算法的泛化潜力。此外,我们系统地探索了三种流行的扩展策略以增强它们在域内和跨域的能力,并对每个范式的性能扩展得出了独特的见解。 我们希望这项研究为未来开发更有效的RL算法开辟新的道路,在自回归图像生成领域实现稳健的CoT推理。代码可在[此处](https://this_https_URL)获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2505.17017
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
多模态大型语言模型(MLLMs)在视觉任务方面取得了迅速进展,但它们的空间理解能力仍局限于单张图像的范畴,这使得这些模型不太适合需要跨多帧进行推理的机器人技术和其他现实世界应用。为此,本文提出了一种框架,旨在通过整合深度感知、视觉对应和动态感知,增强MLLMs在处理多帧数据时的空间理解能力。 我们方法的核心是MultiSPA数据集,这是一个新颖且大规模的数据集合,包含超过2700万个样本,涵盖了各种3D和4D场景。除了MultiSPA数据集之外,我们还提出了一套全面的基准测试框架,在统一的标准下评估一系列空间任务的表现。我们的模型——多帧时空MLLM(Multi-SpatialMLLM),在与基线系统及专有系统的对比中取得了显著的进步,证明了其具备可扩展和通用性的跨多帧推理能力。 此外,我们还观察到了该模型在处理多种任务时的协同效应,并发现了它在面对挑战性场景时所展现出的新颖能力。最后,我们展示了我们的模型如何能够作为机器人技术中的多帧奖励标注器使用。
https://arxiv.org/abs/2505.17015
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
多模态大型语言模型(MLLMs)在问答任务中取得了令人印象深刻的成功,但它们的空间理解能力却鲜有探索。本研究探讨了一个关键问题:现有的MLLM是否具备三维空间感知和理解的能力?具体而言,本文作出了以下贡献: (i) 我们引入了VGBench,这是一个专门设计的基准测试工具,用于评估MLLM在视觉几何感知方面的能力,例如相机姿态和运动估计; (ii) 我们提出了SpatialScore,这是迄今为止最为全面且多样化的多模态空间理解基准测试,它将VGBench与来自其他11个现有数据集的相关数据进行了整合。该基准测试包含了28,000多个样本,涵盖了各种空间理解任务、模式以及问答格式,并包含了一个精心策划的挑战性子集SpatialScore-Hard; (iii) 我们开发了SpatialAgent,这是一个新型多代理系统,集成有9种专门的空间理解工具,支持计划-执行和反思行动(ReAct)推理范式; (iv) 我们进行了广泛评估,揭示了空间推理中持久存在的挑战,并展示了SpatialAgent的有效性。 我们认为,SpatialScore将提供宝贵的见解,并作为下一代MLLM演进的严格基准。
https://arxiv.org/abs/2505.17012
Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
大型语言模型(LLM)非常强大,但由于静态知识的限制,它们容易产生幻觉。检索增强生成(RAG)通过注入外部信息来帮助解决这一问题,但目前的方法往往成本高昂、泛化能力差或忽视了模型内部的知识。在本文中,我们引入了一个名为 R1-Searcher++ 的新框架,旨在训练 LLM 自适应地利用内外部知识源。 R1-Searcher++ 采用两阶段的训练策略:初始的 SFT Cold-start 阶段用于初步格式学习,随后是使用结果监督进行动态知识获取的强化学习(RL)阶段。该 RL 阶段包括一个奖励机制来鼓励模型利用内部知识,并结合了记忆机制以持续吸收检索到的信息,从而丰富模型的内部知识。 通过整合内部知识和外部搜索引擎,R1-Searcher++ 模型能够不断提升其能力,实现高效的检索增强推理。我们的实验表明 R1-Searcher++ 超过了之前的 RAG 和推理方法,并实现了高效检索。代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.17005
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.
大型语言模型(LLMs)已被证明在复杂的逻辑推理任务中表现出突破性的性能。然而,大多数现有的研究集中在使用正式语言引导LLM进行可靠推导路径的开发上,而这些能力系统的评估仍然有限。在这篇论文中,我们的目标是利用形式化语言对各种逻辑推理问题进行全面评估大型语言模型(LLMs)的表现。 从三个维度来看,即LLMs的光谱、任务分类和轨迹格式,我们的重要发现包括: 1. 思维模型在使用正式语言时显著优于指令型模型; 2. 所有LLM在归纳推理能力上都存在局限性,无论是否使用正式语言; 3. 采用PoT(Proof of Trace)格式的数据在所有其他语言中实现了最佳的泛化性能。 此外,我们还整理了与形式化相关的训练数据,以进一步增强小型语言模型,并通过实验结果表明,简单的拒绝微调方法可以更好地使LLM跨不同正式语言进行泛化,并达到最佳的整体表现。我们的代码和报告可在该网址获取(请在原文链接处输入实际的URL)。
https://arxiv.org/abs/2505.16998
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
基于大型语言模型(LLM)的多智能体系统(MAS)通过允许多个专业化代理之间的合作,扩展了单一LLM的能力。然而,大多数现有的MAS框架依赖于单个LLM来驱动所有代理,从而限制了系统的智能水平到该模型的极限。本文探讨了一种异构大型语言模型驱动的多智能体系统(X-MAS)范式,在这种系统中,各个代理由不同的大型语言模型提供动力,将系统的潜力提升到了多样化的大型语言模型集体智慧的高度。我们介绍了X-MAS-Bench,这是一个全面的测试平台,旨在评估各种LLM在不同领域和MAS相关功能上的表现。作为一项广泛的经验研究,我们在五个领域(涵盖21个测试集)和五种功能上对27种不同的LLM进行了超过170万次评估,以识别每个域-功能组合的最佳模型选择。基于这些发现,我们展示了从同质到异构大型语言模型驱动的多智能体系统的转变可以在不进行结构性重新设计的情况下显著提升系统性能。具体而言,在仅限于聊天机器人的MAS场景中,异构配置在MATH数据集上的表现提高了最多8.4%。在一个混合了聊天机器人和推理者的场景中,异构MAS在AIME数据集上实现了令人瞩目的47%的表现提升。我们的结果强调了异构大型语言模型在多智能体系统中的变革潜力,并为开发可扩展、协作的人工智能系统开辟了一条前景光明的道路。
https://arxiv.org/abs/2505.16997
Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
近期在情感支持对话(ESC)领域取得的进展,通过监督微调(SFT)对大规模语言模型(LLMs)进行精细调整,提高了情感支持生成的质量。然而,常见的心理错误仍然存在。直接偏好优化(DPO)通过成对偏好学习来减少这些错误显示出潜力,但其在ESC任务中的有效性受到两个关键挑战的限制:(1) 数据结构复杂交织:现有的ESC数据固有地将心理策略和响应内容纠缠在一起,这使得构建高质量的偏好对变得困难;(2) 优化模糊性:将原始DPO应用于这种复杂的成对数据会导致训练目标不明确。为了解决这些问题,我们引入了推断式偏好挖掘(IPM)来构建高质量的偏好数据,并形成了IPM-PrefDial数据集。基于这些数据,我们提出了一个解耦ESC框架,该框架借鉴了格罗斯的情感调节扩展过程模型,将ESC任务分解为两个顺序子任务:策略规划和共鸣回应生成。每个子任务都通过SFT进行训练,并进一步通过DPO进行优化以符合心理偏好。广泛的实验表明,我们的解耦ESC框架优于联合优化基线,在减少偏好评分偏差的同时提升了响应质量。
https://arxiv.org/abs/2505.16995
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at this https URL.
大型推荐模型已经通过编码或项目生成将大规模语言模型(LLM)扩展为强大的推荐系统,并且最近在LLM推理方面的突破同步激励了在推荐领域中探索推理能力。目前的研究通常将LLM定位为外部推理模块,以提供辅助思考来增强传统的推荐流水线。然而,这种解耦设计由于显著的资源成本和次优联合优化而受到限制。为了应对这些问题,我们提出了一个统一的大规模推荐模型\name,该模型具备内在的推理能力。 首先,我们重新构想模型架构,使推理与推荐在自回归过程中能够交错进行。随后,我们提出RecPO框架,这是一个对应的强化学习框架,它通过单一策略更新同时优化\name的推理和推荐能力;RecPO引入了一种融合奖励方案,仅利用推荐标签来模拟推理能力,消除了对专门推理注释的依赖。 在三个具有不同基线的数据集上进行的实验验证了\name的有效性,显示其Hit@5(点击率)相对提高了68.67%,NDCG@20(归一化折扣累积增益)相对提高了45.21%。代码可在提供的URL获取。 这一段文字描述了一个名为“\name”的新模型及其配套的强化学习框架RecPO的设计和效果,该系统旨在改进大型推荐模型通过内置推理能力来优化推荐性能的方法,并展示了其在实验中的优越表现。
https://arxiv.org/abs/2505.16994
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
在这项工作中,我们提出了Dimple,这是首个离散扩散多模态大型语言模型(DMLLM)。我们观察到,使用纯粹的离散扩散方法进行训练会导致显著的训练不稳定、次优性能和严重的长度偏差问题。为了应对这些挑战,我们设计了一种新的训练范式,该范式结合了初始自回归阶段与后续的扩散阶段。这种方法生成了Dimple-7B模型,其在相同的语料库上进行了训练,并使用了类似于LLaVA-NEXT的训练管道。最终,Dimple-7B以3.9%的优势超越了LLaVA-NEXT,这表明DMLLM可以实现与自回归模型相当的性能。 为了提高推理效率,我们提出了一种名为“自信解码”的解码策略,该策略在每个步骤中动态调整生成令牌的数量,显著减少了生成迭代次数。在自回归模型中,生成期间的前向迭代次数等于响应长度。然而,在使用自信解码时,Dimple所需的迭代次数仅为响应长度的$\frac{1}{3}$。 此外,我们重新实现了自回归模型中的填充技术,并展示了这种技术对大多数基准评估性能影响不大,但提供了1.5倍到7倍的速度提升。我们也探讨了Dimple利用结构先验精准控制其响应的能力。这些先验使得以不同于指令或链式思考提示的方式生成结构化回复成为可能,从而可以精确地控制回复格式和长度,而这在自回归模型中是难以实现的。 总的来说,这项工作验证了DMLLM的可行性和优势,并提高了它的推理效率和可控性。代码与模型可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2505.16990
LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
基于大型语言模型(LLM)的多智能体系统(MAS)在实际应用中展现出提升单一LLM能力以应对复杂和多样化任务的巨大潜力。尽管取得了一定进展,该领域仍缺乏一个统一的代码库来整合现有方法,这导致了重复实现的努力、不公平的比较以及研究人员较高的入门门槛。为解决这些问题,我们引入了MASLab——一个统一、全面且适合研究者使用的基于LLM的MAS代码库。 1. MASLab集成了超过20种跨多个领域的成熟方法,并通过与官方实现逐步骤输出对比的方式严谨验证每一种方法。 2. MASLab提供了一个统一的环境,包括多种基准测试以进行公平的方法比较,确保一致的输入和标准化的评估协议。 3. MASLab在共享的精简结构中实现了各种方法,降低了理解和扩展的门槛。基于MASLab,我们进行了广泛的实验,覆盖了10多个基准测试和8种模型,为研究人员提供了对当前MAS方法领域清晰全面的观点。 未来,MASLab将继续发展,跟踪该领域的最新进展,并邀请更广泛开源社区的贡献。
https://arxiv.org/abs/2505.16988
Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
大型语言模型(LLMs)已经展示了作为智能代理解决复杂问题的卓越能力。然而,在涉及API或工具调用之间依赖关系的情景下——特别是在多轮对话中——进行有效规划仍然是一个重大的挑战。为了解决这个问题,我们推出了T1,这是一个增强型、跨领域、多轮会话的数据集,专门设计用于捕捉和管理不同领域的工具间的相互依赖性。T1通过集成的缓存机制(支持短期和长期记忆)帮助智能代理在九个不同的领域(包括4个单一领域和5个多领域)协调使用工具,并支持动态重新规划——例如决定是否重新计算或重用已缓存的结果。 除了促进关于工具使用和计划的研究外,T1还作为评估开源语言模型性能的基准。我们介绍了由T1-Agent提供支持的结果,展示了它们在复杂、依赖于工具的情景中进行规划和推理的能力。
https://arxiv.org/abs/2505.16986
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
训练后的微调已经证明了其在增强大规模语言模型(LLMs)推理能力方面的重要性。主要的微调方法可以分为监督微调(SFT)和强化微调(RFT)两大类。SFT因其效率高且适合小型语言模型而受到青睐,但可能导致过拟合,并限制大型模型的推理能力。相比之下,RFT通常能产生更好的泛化效果,但是高度依赖于基础模型的质量。为了克服SFT和RFT的局限性,我们提出了一种新的微调范式——统一微调(UFT),它将SFT和RFT整合为一个单一且集成的过程。UFT使模型能够有效地探索解决方案,同时融入信息丰富的监督信号,弥合了现有方法中记忆与思考之间的差距。值得注意的是,无论模型大小如何,UFT在总体上都优于SFT和RFT。此外,我们从理论上证明了UFT突破了RFT内在的指数级样本复杂性瓶颈,首次展示了统一训练能够在长时态推理任务上实现指数级加速收敛的效果。
https://arxiv.org/abs/2505.16984
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.
大型语言模型(LLMs)主要设计用于批处理。现有将LLMs适应流处理的方法要么依赖昂贵的重新编码,要么依赖具有有限可扩展性的专用架构。本研究识别了在将面向批次的LLM调整为流模式时出现的三个关键不匹配:(1) 输入注意机制不匹配;(2) 输出注意机制不匹配;以及 (3) 位置ID不匹配。尽管通常认为后两种不匹配需要频繁重新编码,但我们的分析表明,只有输入注意机制不匹配显著影响性能,这意味着输出的重新编码在很大程度上是不必要的。为了更好地理解这种与普遍假设之间的差异,我们首次对LLMs在流处理中的位置编码影响进行了全面分析,显示保持源和目标上下文内的相对位置比维持绝对顺序更为关键。 受到上述分析的启发,我们引入了一种基于批处理架构的组位置编码范式,以增强流模式与批处理模式之间的一致性。跨语言和跨模态任务上的大量实验表明,我们的方法优于现有方法,并且无需对架构进行修改,在流模式和批处理模式下都表现出强大的泛化能力。代码可在以下网址获取:[此链接](请将“this https URL”替换为实际的URL)。
https://arxiv.org/abs/2505.16983
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
单一代理LLM面临硬性限制——有限的上下文、角色过载和脆弱的知识领域转移。传统多代理解决方案虽然减轻了这些问题,但也暴露出新的问题:不恰当的任务分解、模糊不清的合作协议以及验证成本高昂,削弱了改进效果。因此,我们提出了一种名为“掌握诀窍”(Know-The-Ropes, KtR)的框架,该框架将领域的先验知识转化为算法蓝图层级结构,在这种结构中,任务被递归地拆分为有类型的、由控制器中介的子任务,每个子任务要么直接解决,要么通过最轻量级的方法进行增强(例如:思维链推理、微调或自我检查)。基于“没有免费午餐”的定理,KtR放弃了寻找通用提示符的努力,转而强调有条不紊的任务分解。 在背包问题(3-8个物品)上,使用三个GPT-4o-mini代理,在补全单一瓶颈代理后,从零样本的3%准确率提高到大小为5的情况下的95%。对于更具挑战性的任务分配问题(6-15项工作),一个由六个o3-mini蓝图组成的系统在规模达到10时能够实现100%的正确率,并且在规模13-15时也能保持84%的准确度,相比之下零样本情况下的准确率为11%。 通过算法意识的任务分解加上有针对性的增强,这种框架使中等大小的模型成为可靠的合作伙伴——无需构建越来越大、越来越复杂的单一代理系统。
https://arxiv.org/abs/2505.16979