Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
视觉生成模型在从文本提示创建逼真的图像方面取得了显著进展,但处理包含多个对象及其精确空间关系和属性的复杂提示时仍面临挑战。有效应对这些复杂提示需要对语义内容和空间布局进行明确推理。我们提出了GoT-R1框架,该框架利用强化学习来增强视觉生成中的语义-空间推理能力。基于生成链式思维方法,GoT-R1使模型能够自主发现超越预定义模板的有效推理策略,这得益于精心设计的强化学习机制。 为了实现这一点,我们提出了一种双阶段多维度奖励框架,该框架利用大规模语言模型(MLLMs)来评估推理过程和最终输出,从而在整个生成管道中提供有效的监督。奖赏系统以统一的方式评估语义一致性、空间准确性以及视觉质量。 实验结果显示,在涉及精确空间关系和属性绑定的组合任务上,GoT-R1在T2I-CompBench基准测试中取得了显著改进,尤其是在处理复杂的组成性任务方面。GoT-R1通过成功将高级推理能力转移到图像生成领域,推动了这一领域的最新研究进展。 为了促进未来的相关研究工作,我们已在[此链接](https://此链接提供具体URL)上公开发布了代码和预训练模型。
https://arxiv.org/abs/2505.17022
As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: this https URL
随着大型多模态模型(LMMs)能力的提升,人们对评估其推理过程的兴趣也在增加,而不仅仅是关注它们的最终输出。然而,大多数基准测试仍然主要集中在英语上,忽略了具有丰富语言和文化背景的语言,例如阿拉伯语。为了解决这一不足,我们引入了全面的阿拉伯多模态推理基准(ARB),这是第一个旨在通过文本和视觉两种模式评估阿拉伯语分步推理过程的基准。ARB涵盖了包括视觉推理、文档理解、光学字符识别(OCR)、科学分析以及文化解读在内的11个不同领域,并包含了1,356个多模态样本,与之相配的是5,119个人工策划的推理步骤和相应的行为。我们对12种最先进的开源和闭源LMMs进行了评估,发现它们在一致性、忠实度和文化基础方面仍然存在持续性的挑战。 ARB为诊断代表不足的语言中的多模态推理提供了一个结构化的框架,并标志着向包容性更强、更透明且更具文化意识的人工智能系统迈进的关键一步。我们发布了基准测试、评分标准以及用于支持未来研究和可重复性的评估工具。相关代码可在以下网址获取:this https URL
https://arxiv.org/abs/2505.17021
The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
大型多模态模型(LMMs)的出现显著增强了大型语言模型(LLMs)处理和解释多样化数据模式(如图像和视频)的能力。然而,随着输入复杂性的增加,尤其是在长视频序列的情况下,所需的token数量大幅增长,导致计算成本呈二次方增长。这使得在保持性能完整性的前提下高效压缩LMMs中的视频token成为一个紧迫的研究挑战。 在本文中,我们介绍了CrossLMM,通过双交叉注意力机制将长视频序列从LMMs中解耦,从而显著减少了视觉token的数量,并且几乎不会对性能产生负面影响。具体来说,我们首先通过对预训练的视觉编码器使用池化方法实现大量token的减少。然后,在LLM层内,我们采用了一种视觉到视觉的交叉注意力机制,其中池化的视觉tokens作为查询与原始视觉token集合进行比较。这一模块使得更高效的token利用成为可能,并且保持了细粒度的信息保真度。 此外,我们引入了一个文本到视觉的交叉注意力机制,在该机制中,文本tokens通过与原始视觉tokens互动而增强,从而丰富了对文本tokens的理解。 全面的经验评估表明,尽管使用了显著较少的计算资源,我们的方法在各种基于视频的LMM基准测试上实现了可比或更优的表现。
https://arxiv.org/abs/2505.17020
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.
最近的研究表明,通过基于规则的强化学习(RL)利用结果奖励可以成功地激发多模态大型语言模型(MLLMs)的强大推理能力。然而,这种范式通常缺乏对最终答案形成过程中的思维流程进行监督,因此模型可能会学到次优的推理策略,这会妨碍其泛化能力。为此,我们提出了SophiaVL-R1,试图在这个框架中为思考过程添加奖励信号。 为了实现这一目标,我们首先训练了一个评估整个思维过程质量的思维奖励模型。鉴于某些样本中的思维奖励可能由于奖励操控而不可靠,我们提出了一种Trust-GRPO方法,在此过程中根据导致正确答案和错误答案响应之间的思维奖励比较来分配一个可信度权重。这种方法有助于减轻潜在不准确思维奖励的影响。 此外,我们设计了一个退火训练策略,该策略会随着时间的推移逐渐减少思维奖励的重要性,使得模型在后期的训练阶段能够更多地依赖于准确的结果奖励。实验结果表明,我们的SophiaVL-R1在多个基准测试(如MathVisita、MMMU)上超越了一系列推理MLLMs,展示了强大的推理和泛化能力。特别值得注意的是,在大多数基准测试中,我们的参数量较少的SophiaVL-R1-7B模型甚至超过了具有10倍更多参数的LLaVA-OneVision-72B。 所有代码、模型和数据集均已公开发布在以下链接:[此链接需要您根据原文进行填写]。
https://arxiv.org/abs/2505.17018
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
最近的研究进展强调了强化学习(RL)在增强大型语言模型(LLMs)中的链式思维(CoT)推理能力方面的重要作用。两种突出的RL算法,直接偏好优化(DPO)和群体相对策略优化(GRPO),是这些发展中的核心,展示了各自的优点和缺点。自回归图像生成,也可以视为一种序列式的CoT推理过程,提出了不同于基于LLM的CoT推理的独特挑战。这些问题包括确保文本与图像的一致性、提高图像美学质量以及设计复杂的奖励模型,而不是依赖简单的规则基础奖励。虽然最近的努力已经将RL扩展到这一领域,但这些探索通常缺乏对该领域特定挑战和不同RL策略特性的深入分析。 为了填补这一空白,我们首次对自回归图像生成中GRPO和DPO算法进行了全面调查,评估它们在域内性能以及跨域泛化的能力,并审查不同的奖励模型对其能力的影响。我们的研究结果表明,GRPO和DPO展现了各自的独特优势,而且具备更强内在泛化能力的奖励模型可能会提升所应用RL算法的泛化潜力。此外,我们系统地探索了三种流行的扩展策略以增强它们在域内和跨域的能力,并对每个范式的性能扩展得出了独特的见解。 我们希望这项研究为未来开发更有效的RL算法开辟新的道路,在自回归图像生成领域实现稳健的CoT推理。代码可在[此处](https://this_https_URL)获取(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2505.17017
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
我们介绍了一种基于强化学习的简单且可扩展的交互式后期训练范例——RIPT-VLA,该方法仅使用稀疏二元成功奖励对预训练的视觉-语言-动作(VLA)模型进行微调。现有的VLA训练流水线依赖于大量的离线专家演示数据和监督模仿学习,这限制了它们在低数据环境下的适应能力。RIPT-VLA通过启用基于动态采样和留一法优势估计的稳定策略优化算法的交互式后期训练来解决这个问题。 RIPT-VLA具有以下特点:首先,它适用于各种VLA模型,在轻量级QueST模型上提高了21.2%,并且在7B OpenVLA-OFT模型上达到了前所未有的97.5%的成功率。其次,它在计算和数据使用方面都十分高效:仅用一次演示,RIPT-VLA就使原本几乎无法工作的SFT模型(成功率仅为4%)在经过15次迭代后将成功率提高到97%。此外,我们还展示了由RIPT-VLA学习的策略能够跨不同任务和场景进行泛化,并且对初始状态背景具有鲁棒性。 这些结果凸显了RIPT-VLA作为一种通过最小监督有效提升VLA模型后期训练性能的方法的实际价值与效果。
https://arxiv.org/abs/2505.17016
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
多模态大型语言模型(MLLMs)在视觉任务方面取得了迅速进展,但它们的空间理解能力仍局限于单张图像的范畴,这使得这些模型不太适合需要跨多帧进行推理的机器人技术和其他现实世界应用。为此,本文提出了一种框架,旨在通过整合深度感知、视觉对应和动态感知,增强MLLMs在处理多帧数据时的空间理解能力。 我们方法的核心是MultiSPA数据集,这是一个新颖且大规模的数据集合,包含超过2700万个样本,涵盖了各种3D和4D场景。除了MultiSPA数据集之外,我们还提出了一套全面的基准测试框架,在统一的标准下评估一系列空间任务的表现。我们的模型——多帧时空MLLM(Multi-SpatialMLLM),在与基线系统及专有系统的对比中取得了显著的进步,证明了其具备可扩展和通用性的跨多帧推理能力。 此外,我们还观察到了该模型在处理多种任务时的协同效应,并发现了它在面对挑战性场景时所展现出的新颖能力。最后,我们展示了我们的模型如何能够作为机器人技术中的多帧奖励标注器使用。
https://arxiv.org/abs/2505.17015
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
概念擦除,即模型选择性防止生成特定概念的能力,已引起了越来越多的关注,并且出现了多种方法来应对这一挑战。然而,这些方法是否能够彻底清除目标概念仍不清楚。我们首先提出了两种关于扩散模型中擦除机制的概念模型:(i) 减少生成目标概念的可能性;(ii) 干扰模型的内部引导机制。为了全面评估某个概念是否已被真正从模型中删除,我们引入了一套独立的评估方法。我们的评估框架包括对抗性攻击、新颖的探测技术以及对模型在擦除特定概念后产生的替代内容进行分析。我们的结果揭示了在最小化副作用和保持对抗提示下的鲁棒性之间存在的矛盾。总的来说,我们的工作强调了在扩散模型中进行彻底评估的重要性。
https://arxiv.org/abs/2505.17013
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
多模态大型语言模型(MLLMs)在问答任务中取得了令人印象深刻的成功,但它们的空间理解能力却鲜有探索。本研究探讨了一个关键问题:现有的MLLM是否具备三维空间感知和理解的能力?具体而言,本文作出了以下贡献: (i) 我们引入了VGBench,这是一个专门设计的基准测试工具,用于评估MLLM在视觉几何感知方面的能力,例如相机姿态和运动估计; (ii) 我们提出了SpatialScore,这是迄今为止最为全面且多样化的多模态空间理解基准测试,它将VGBench与来自其他11个现有数据集的相关数据进行了整合。该基准测试包含了28,000多个样本,涵盖了各种空间理解任务、模式以及问答格式,并包含了一个精心策划的挑战性子集SpatialScore-Hard; (iii) 我们开发了SpatialAgent,这是一个新型多代理系统,集成有9种专门的空间理解工具,支持计划-执行和反思行动(ReAct)推理范式; (iv) 我们进行了广泛评估,揭示了空间推理中持久存在的挑战,并展示了SpatialAgent的有效性。 我们认为,SpatialScore将提供宝贵的见解,并作为下一代MLLM演进的严格基准。
https://arxiv.org/abs/2505.17012
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
我们提出了一种自适应时间因果视频分词器AdapTok,它可以根据视频内容灵活地为不同帧分配标记。AdapTok装备有块级别的掩码策略,在训练过程中随机丢弃每个块的尾部标记,并且有一个块因果评分器用于预测使用不同数量令牌时视频帧的重建质量。在推理阶段,我们进一步提出了一种基于整数线性规划的自适应令牌分配策略,以根据预测得分调整令牌使用情况。这种设计允许在可控的整体预算下进行样本级、内容感知和时间动态变化的标记分配。 在UCF-101和Kinetics-600数据集上的大量实验表明了我们方法的有效性。无需额外的图像数据,在不同的令牌预算下,AdapTok能够持续提高重建质量和生成性能,从而允许更可扩展且高效的视频生成建模。
https://arxiv.org/abs/2505.17011
Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.
岩石薄片的矿物学分析对于油气储层评价是一项重要任务。然而,人工分析往往主观且耗时。虽然诸如QEMSCAN(R)等技术旨在自动化矿物学测绘过程,但它们也存在高成本和耗时的问题。本研究提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)的模型,用于自动分割碳酸盐岩薄片图像中的矿物区域,该方法以低成本、通用且高效的方式模拟了QEMSCAN制图流程。 为了实现这一目标,使用U-Net语义分割架构对平面偏光和交叉偏光下的岩石薄片图像进行训练,并将对应的QEMSCAN地图作为目标。这种基于QEMSCAN地图的训练方式在研究中不常被采用。模型被设定为区分方解石、白云石、镁质粘土矿物、石英、孔隙以及其他矿物相(标记为“Others”)。为了评估其泛化能力,该模型不仅对训练期间见过的岩相进行了验证,还测试了未见过的岩相。 由于图像和地图提供时分辨率不同,应用图像配准技术使它们在空间上对齐。研究表明,分割质量很大程度上取决于这些分辨率差异以及可学习岩石纹理的多样性。然而,研究结果显示出令人鼓舞的结果,特别是在固体纹理中矿物边界划定得当,并且能精确估计矿物分布。该模型预测与实际分布之间呈现出近似线性关系,对于见过和未见岩相分别获得了0.97以上的决定系数(R^2)值。 总的来说,这项工作通过利用深度学习技术提供了低成本、高效率的岩石薄片矿物学分析方法,展示了在油气储层评价中的潜在应用价值。
https://arxiv.org/abs/2505.17008
Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
从互联网视频中学习潜在运动对于构建通用型机器人至关重要。然而,现有的离散潜在动作方法存在信息损失的问题,并且难以处理复杂和细微的动态变化。为此我们提出了CoMo(Continuous Motion),旨在从多样化的、大规模的互联网视频中学习更为详尽的连续运动表示。 CoMo采用了早期时间特征差分机制来防止模型崩溃并抑制静态外观噪声,从而有效避免了捷径学习问题的发生。同时,遵循信息瓶颈原则,我们将潜在运动嵌入维度进行限制,以在保留足够的与动作相关的信息和最小化无关的外观噪声之间取得更好的平衡。 此外,我们还引入了两个新的评估指标,用于更加稳健且经济地评价运动并指导运动学习方法的发展:(i)动作预测线性探测MSE;(ii)过去到当前及未来到当前运动嵌入之间的余弦相似度。这两个指标对于衡量模型在不同时间和视角下保持一致性和相关性的能力至关重要。 最关键的是,CoMo展示出了强大的零样本泛化能力,使其能够为之前未见过的视频领域生成连续伪动作。这种能力使得利用从无标签视频数据集中提取的各种伪动作进行统一策略联合学习成为可能(例如跨实体视频和显著的人类演示视频),这在必要时可以结合有限标记的机器人数据进一步增强。 广泛的实验表明,与CoMo伪动作协同训练的策略在模拟和现实世界环境中使用扩散模型和自回归架构均表现出卓越性能。
https://arxiv.org/abs/2505.17006
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
大型语言模型(LLM)非常强大,但由于静态知识的限制,它们容易产生幻觉。检索增强生成(RAG)通过注入外部信息来帮助解决这一问题,但目前的方法往往成本高昂、泛化能力差或忽视了模型内部的知识。在本文中,我们引入了一个名为 R1-Searcher++ 的新框架,旨在训练 LLM 自适应地利用内外部知识源。 R1-Searcher++ 采用两阶段的训练策略:初始的 SFT Cold-start 阶段用于初步格式学习,随后是使用结果监督进行动态知识获取的强化学习(RL)阶段。该 RL 阶段包括一个奖励机制来鼓励模型利用内部知识,并结合了记忆机制以持续吸收检索到的信息,从而丰富模型的内部知识。 通过整合内部知识和外部搜索引擎,R1-Searcher++ 模型能够不断提升其能力,实现高效的检索增强推理。我们的实验表明 R1-Searcher++ 超过了之前的 RAG 和推理方法,并实现了高效检索。代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.17005
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL
我们提出了一种基于偏微分方程(PDE)逆问题的条件采样通用框架,旨在从极度稀疏或噪声较大的测量数据中恢复完整的解。此目标通过函数空间扩散模型和插件播放指导来实现条件设置。我们的方法首先使用神经算子架构训练一个无条件且与离散化无关的去噪模型。在推断阶段,我们利用基于梯度的引导机制细化样本以满足稀疏观测数据的要求。通过严格的数学分析,我们将Tweedie公式扩展到无限维希尔伯特空间中,为我们的后验采样方法提供了理论基础。我们的方法(FunDPS)在极小监督和极端数据稀缺条件下准确捕捉函数空间中的后验分布。在五项仅包含3%观测数据的PDE任务上,与最先进的固定分辨率扩散基准相比,我们的方法平均提高了32%的准确性,并将采样步骤减少了4倍。此外,多分辨率微调确保了强大的跨分辨率泛化能力。据我们所知,这是首个独立于离散化的基于扩散的方法框架,为偏微分方程上下文中的正向和逆向问题提供了一种实用且灵活的解决方案。代码可在此网址获得:[请在这里插入实际链接]
https://arxiv.org/abs/2505.17004
We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
我们研究了面部与声音之间关联学习的任务,这一任务最近在多模态社区引起了广泛关注。这些方法面临着故意设计负样本挖掘过程以及依赖于远离边际参数的问题。这些问题通过在一个共同嵌入空间中应用正交约束来解决,该空间融合了面部和声音的嵌入表示。然而,面部和声音的嵌入空间具有不同的特性,并且在将它们融合之前需要对齐这些空间。为此,我们提出了一种方法,能够准确地对齐嵌入空间并使用增强型门控融合技术将其融合在一起,从而提高面部与声音关联任务的表现。在VoxCeleb数据集上的广泛实验揭示了所提方法的优势。
https://arxiv.org/abs/2505.17002
This paper studies the task of SatStreet-view synthesis, which aims to render photorealistic street-view panorama images and videos given any satellite image and specified camera positions or trajectories. We formulate to learn neural radiance field from paired images captured from satellite and street viewpoints, which comes to be a challenging learning problem due to the sparse-view natural and the extremely-large viewpoint changes between satellite and street-view images. We tackle the challenges based on a task-specific observation that street-view specific elements, including the sky and illumination effects are only visible in street-view panoramas, and present a novel approach Sat2Density++ to accomplish the goal of photo-realistic street-view panoramas rendering by modeling these street-view specific in neural networks. In the experiments, our method is testified on both urban and suburban scene datasets, demonstrating that Sat2Density++ is capable of rendering photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.
本文研究了SatStreet视图合成任务,该任务旨在根据任何卫星图像和指定的相机位置或轨迹生成逼真的街景全景图像和视频。我们提出了一种从卫星视角和街道视角捕捉到的一对一图像中学习神经辐射场的方法,但由于卫星和街景图像之间的极大幅度视角变化以及稀疏视图自然性的原因,这成为一个具有挑战性的问题。基于特定任务的观察结果,即包括天空和光照效果在内的街景特有元素仅在街景全景中可见,我们提出了一种名为Sat2Density++的新方法,通过在神经网络中建模这些街景特有的元素来实现逼真的街景全景渲染的目标。实验表明,在城市和郊区场景数据集上测试时,我们的方法证明了Sat2Density++能够生成多视角一致且忠实于卫星图像的逼真街景全景图。
https://arxiv.org/abs/2505.17001
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.
大型语言模型(LLMs)已被证明在复杂的逻辑推理任务中表现出突破性的性能。然而,大多数现有的研究集中在使用正式语言引导LLM进行可靠推导路径的开发上,而这些能力系统的评估仍然有限。在这篇论文中,我们的目标是利用形式化语言对各种逻辑推理问题进行全面评估大型语言模型(LLMs)的表现。 从三个维度来看,即LLMs的光谱、任务分类和轨迹格式,我们的重要发现包括: 1. 思维模型在使用正式语言时显著优于指令型模型; 2. 所有LLM在归纳推理能力上都存在局限性,无论是否使用正式语言; 3. 采用PoT(Proof of Trace)格式的数据在所有其他语言中实现了最佳的泛化性能。 此外,我们还整理了与形式化相关的训练数据,以进一步增强小型语言模型,并通过实验结果表明,简单的拒绝微调方法可以更好地使LLM跨不同正式语言进行泛化,并达到最佳的整体表现。我们的代码和报告可在该网址获取(请在原文链接处输入实际的URL)。
https://arxiv.org/abs/2505.16998
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
基于大型语言模型(LLM)的多智能体系统(MAS)通过允许多个专业化代理之间的合作,扩展了单一LLM的能力。然而,大多数现有的MAS框架依赖于单个LLM来驱动所有代理,从而限制了系统的智能水平到该模型的极限。本文探讨了一种异构大型语言模型驱动的多智能体系统(X-MAS)范式,在这种系统中,各个代理由不同的大型语言模型提供动力,将系统的潜力提升到了多样化的大型语言模型集体智慧的高度。我们介绍了X-MAS-Bench,这是一个全面的测试平台,旨在评估各种LLM在不同领域和MAS相关功能上的表现。作为一项广泛的经验研究,我们在五个领域(涵盖21个测试集)和五种功能上对27种不同的LLM进行了超过170万次评估,以识别每个域-功能组合的最佳模型选择。基于这些发现,我们展示了从同质到异构大型语言模型驱动的多智能体系统的转变可以在不进行结构性重新设计的情况下显著提升系统性能。具体而言,在仅限于聊天机器人的MAS场景中,异构配置在MATH数据集上的表现提高了最多8.4%。在一个混合了聊天机器人和推理者的场景中,异构MAS在AIME数据集上实现了令人瞩目的47%的表现提升。我们的结果强调了异构大型语言模型在多智能体系统中的变革潜力,并为开发可扩展、协作的人工智能系统开辟了一条前景光明的道路。
https://arxiv.org/abs/2505.16997