Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
工业异常检测(IAD)由于正常参考样本的稀缺性和许多缺陷的细微、局部性质而变得困难。单一通过的视觉-语言模型(VLMs)通常会忽略小的异常情况,并且缺乏与标准正常模式进行比较的明确机制。我们提出了AgentIAD,这是一个工具驱动的代理框架,能够执行多阶段视觉检查。该代理配备了感知缩放器(PZ),用于局部细粒度分析和对比检索器(CR),在证据模棱两可时查询正常的样本。为了教授这些检测行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并分两个阶段训练模型:监督微调后跟强化学习。 这一过程的设计包括两部分奖励机制:一种是感知奖励,用于监督分类准确性、空间对齐和类型正确性;另一种是行为奖励,鼓励高效地使用工具。这些组件共同作用使模型能够通过逐步观察、缩放和验证来细化其判断。 AgentIAD在MMAD数据集上实现了新的最佳性能,分类准确率达到97.62%,超过了先前基于多模态大语言模型的方法,并且生成的检测痕迹透明可解释。
https://arxiv.org/abs/2512.13671
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
当前的视觉-语言-行动(VLA)范式在自动驾驶领域主要依赖于模仿学习(IL),这种方法会带来固有的挑战,例如分布偏移和因果混淆。在线强化学习通过试错学习提供了一种有前景的方法来解决这些问题。然而,在线强化学习应用于自动驾驶中的VLA模型时,由于连续动作空间中探索效率低下而受到限制。 为了解决这一局限性,我们提出了MindDrive框架,该框架包括一个大型语言模型(LLM),配备有两个不同的LoRA参数集。其中一个LLM作为决策专家,负责场景推理和驾驶决策;另一个则充当行动专家,能够动态地将语言决策映射到可行的轨迹中。通过向推理空间反馈轨迹级奖励,MindDrive使基于有限集合内的离散语言驾驶决策进行试错学习成为可能,而不是直接在连续动作空间内操作。这种方法有效地平衡了复杂场景中的最优决策、类似人类的驾驶行为以及在线强化学习中的高效探索。 在具有挑战性的Bench2Drive基准测试中,MindDrive表现出强大的闭环性能,获得了78.04的驾驶评分(DS)和55.09%的成功率(SR)。据我们所知,这是首次展示在线强化学习在自动驾驶领域VLA模型中有效性的研究。
https://arxiv.org/abs/2512.13636
Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL
空间转录组学(ST)是一种新兴技术,能够帮助研究人员探究组织形态背后的分子关系。然而,获取ST数据的成本仍然非常高昂,并且传统的固定网格采样策略会导致对在形态上相似或生物信息量较少的区域进行重复测量,从而导致稀缺的数据限制了当前的方法。然而,成熟的单细胞测序领域可以提供丰富的生物学数据作为有效的辅助来源以缓解这一局限性。为弥补这些差距,我们提出了SCR2-ST,这是一种统一框架,它利用单细胞先验知识来指导高效的数据采集和准确的表达预测。SCR2-ST整合了一个基于单细胞引导的强化学习(SCRL)主动采样策略以及一个混合回归-检索预测网络SCR2Net。SCRL结合了单细胞基础模型嵌入与空间密度信息,构建出以生物学为依据的奖励信号,在有限测序预算下能够有选择地获取富含信息的组织区域。随后,SCR2Net通过一个将基于回归建模和检索增强推断相结合的混合架构来利用主动采样的数据,并且其中设置了一个主要细胞类型过滤机制来抑制噪声匹配,而被检索到的表达谱则作为软标签用于辅助监督。 我们对三个公开的空间转录组学(ST)数据集进行了SCR2-ST的评估,在低预算场景下展示了在样本采集效率和预测准确度上的最佳性能。代码可在以下网址获取:this https URL
https://arxiv.org/abs/2512.13635
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
使用强化学习(RL)构建通用推理模型面临着显著的跨领域异质性挑战,包括推理时响应长度的巨大变化和验证延迟。这种变异性复杂化了RL基础设施,减慢了训练速度,并且使训练课程(例如,响应长度扩展)和超参数选择变得困难。 在本工作中,我们提出了级联领域的强化学习(Cascade RL),以开发能够在指令模式和深度思考模式下运行的通用推理模型Nemotron-Cascade。不同于传统的将不同领域中异质提示融合在一起的方法,Cascade RL协调了一系列按领域划分的RL过程,从而降低了工程复杂性,并在广泛的基准测试中实现了最先进的性能。 值得注意的是,在使用RLHF(基于人类反馈的强化学习)作为预步骤进行对齐时,这不仅优化了模型偏好,还极大地增强了模型的推理能力。此外,在后续的领域特定RLVR阶段中,罕见地会降低早期领域所达到的基准性能,并且可能会提升其表现(如图1所示)。 我们训练的14B模型在经过RL训练后,在LiveCodeBench v5/v6/Pro上超越了它的SFT教师DeepSeek-R1-0528,并在2025年国际信息学奥林匹克竞赛(IOI)中取得了银牌成绩。我们将透明地分享我们的训练和数据配方。
https://arxiv.org/abs/2512.13607
The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at this https URL.
图像扩散模型的缓慢推理过程显著降低了用户的互动体验。为了解决这一问题,我们提出了Diffusion Preview,这是一种新范式,采用快速、低步数采样生成初步输出供用户评估,并在预览被认定满意后再进行完整的细化处理。现有的加速方法,包括无训练求解器和后训练蒸馏,在提供高质量的预览或确保预览与最终结果之间的一致性方面存在困难。 我们提出了一种基于广义线性多步法(general linear multistep methods)的轻量级、可训练的高阶求解器ConsistencySolver,通过强化学习进行优化。这种求解器不仅提高了预览的质量和一致性,在低步数场景中生成质量和一致性的改进尤为显著,使其非常适合高效的预览与细化工作流程。 实验结果显示,ConsistencySolver在低步数的情况下显著提升了生成质量的一致性,并且在使用47%更少步骤时达到了与Multistep DPM-Solver相当的FID分数,同时超过了蒸馏基线模型的表现。此外,用户研究显示我们的方法将用户的总体交互时间减少了近50%,同时保持了生成的质量。 代码可以在以下链接找到:[此链接](https://this https URL)(请注意需要替换为实际提供的GitHub或其他代码托管平台的URL)。
https://arxiv.org/abs/2512.13592
The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
执行跨模态多跳推理的能力,通过迭代地整合来自各种模式和外部知识的信息来解决复杂的现实世界挑战是至关重要的。然而,现有的多模态大型语言模型(MLLMs)主要局限于单步推理,因为现有基准的复杂性不足以评估和推动多跳能力的发展。为了解决这一差距,我们引入了MMhops,这是一个全新的大规模基准测试平台,旨在系统地评估并促进多模态多跳推理。MMhops数据集包括两个具有挑战性的任务格式:Bridging(桥接)和Comparison(比较),这些格式要求模型动态构建复杂的推理链,并整合外部知识。 为了应对MMhops带来的挑战,我们提出了MMhops-R1,这是一种新颖的多模态检索增强生成(mRAG)框架,旨在进行动态推理。我们的框架利用强化学习来优化模型,使其能够自主规划推理路径、形成有针对性的问题查询并综合多层次信息。全面的实验表明,在MMhops上,MMhops-R1显著优于强大的基线模型,这强调了动态规划和多模态知识整合对于复杂推理的重要性。此外,MMhops-R1在需要固定跳推理的任务中展示了很强的一般化能力,这突显了我们动态规划方法的稳健性。 总之,我们的工作贡献了一个具有挑战性的新基准测试以及一个强大的基线模型,并且我们将发布相关的代码、数据和权重以促进这一关键领域未来的研究。
https://arxiv.org/abs/2512.13573
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
记忆已成为基于基础模型的代理的核心能力,并且将继续保持这一地位。随着关于代理记忆的研究迅速扩展并吸引前所未有的关注,该领域也变得越来越碎片化。现有属于代理记忆范畴的工作在动机、实现和评估协议方面往往存在显著差异,而松散定义的记忆术语进一步模糊了概念的清晰度。传统的分类方法,如长/短期记忆,证明不足以捕捉当代代理记忆系统的多样性。 本文旨在提供当前代理记忆研究的最新全景图。我们首先明确界定代理记忆的范围,并将其与诸如大型语言模型(LLM)记忆、检索增强生成(RAG)、上下文工程等相关概念区分开来。然后,我们通过形式、功能和动态性这三大统一视角审视代理记忆。 从形式的角度来看,我们识别出三种主导型的代理记忆实现方式:令牌级、参数化和潜在记忆。从功能角度来看,我们提出了一种更细粒度的分类法,区分事实记忆、体验记忆和工作记忆。从动态性的角度来看,我们分析了如何随着时间推移形成、演变和检索记忆。 为了支持实际开发,我们编制了一份全面的记忆基准测试和开源框架汇总表。超越整合之外,我们还提出了对未来研究前沿的前瞻性视角,包括记忆自动化、强化学习集成、多模态记忆、多代理记忆以及可信性问题。我们希望此次调查不仅作为现有工作的参考,还可以作为重新思考未来智能设计中记忆这一首要原始概念的概念基础。
https://arxiv.org/abs/2512.13564
Autonomous free-flyers play a critical role in intravehicular tasks aboard the International Space Station (ISS), where their precise docking under sensing noise, small actuation mismatches, and environmental variability remains a nontrivial challenge. This work presents a reinforcement learning (RL) framework for six-degree-of-freedom (6-DoF) docking of JAXA's Int-Ball2 robot inside a high-fidelity Isaac Sim model of the Japanese Experiment Module (JEM). Using Proximal Policy Optimization (PPO), we train and evaluate controllers under domain-randomized dynamics and bounded observation noise, while explicitly modeling propeller drag-torque effects and polarity structure. This enables a controlled study of how Int-Ball2's propulsion physics influence RL-based docking performance in constrained microgravity interiors. The learned policy achieves stable and reliable docking across varied conditions and lays the groundwork for future extensions pertaining to Int-Ball2 in collision-aware navigation, safe RL, propulsion-accurate sim-to-real transfer, and vision-based end-to-end docking.
自主自由飞行器在国际空间站(ISS)舱内任务中扮演着关键角色,它们需要在传感噪声、执行偏差以及环境变化的情况下进行精准对接,这一直是一个具有挑战性的难题。本文介绍了一种基于强化学习(RL)的框架,用于日本宇宙航空研究开发机构(JAXA)Int-Ball2机器人在日本实验模组(JEM)高保真Isaac Sim模型中的六自由度(6-DoF)对接任务。我们使用近端策略优化(PPO)算法,在领域随机化的动力学和受限观测噪声条件下训练并评估控制器,并明确建模了螺旋桨的拖拽力矩效应及极性结构的影响。这使得我们可以系统地研究Int-Ball2推进物理特性如何影响基于RL的微重力约束环境下的对接性能。 学习到的策略能够在各种条件下实现稳定可靠的对接,为未来扩展至碰撞感知导航、安全强化学习、推力精确的真实与模拟转换以及视觉为基础的一站式对接奠定了基础。
https://arxiv.org/abs/2512.13514
Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at this https URL.
具备推理能力的大规模语言模型在多个领域展示了出色的表现。在临床应用中,透明且步骤明确的推理过程为医生提供了强有力的支持决策证据。虽然强化学习已经在医学场景中有效地提升了推理性能,但这些推理过程的临床可靠性仍然受限,因为它们的准确性和有效性往往在训练过程中被忽视。为了填补这一空白,我们提出了MedCEG框架,该框架通过关键证据图(Critical Evidence Graph, CEG)明确监督医疗语言模型中的推理过程,从而增强其医学上有效的推理路径。为此,我们收集了一组具有挑战性的临床案例,并为每个样本算法构建一个CEG来表示高质量且可验证的推理路径。 为了指导推理过程,我们引入了一个临床推理程序奖励机制,该机制评估节点覆盖率、结构正确性和链条完整性,从而对推理质量进行全面评价。实验结果表明,MedCEG在性能上超越了现有方法,并产生了具有临床有效性的推理链,这是可靠医疗AI推理的一个重要进步。 代码和模型可在以下链接获取:[请在这里插入实际的URL]
https://arxiv.org/abs/2512.13510
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at this https URL.
近期在视频生成领域的进展为统一的音视频生成铺平了道路。在这项工作中,我们推出了Seedance 1.5 Pro,这是一个专门为原生联合音频-视频生成而设计的基础模型。通过采用双分支扩散变换器架构,该模型集成了跨模态联合模块和专门化的多阶段数据管道,实现了卓越的音视频同步以及优异的生成质量。 为了确保其实用性,我们在训练后实施了精细的优化措施,包括在高质量数据集上进行监督微调(SFT)及利用多维奖励模型的人类反馈强化学习(RLHF)。此外,我们还引入了一个加速框架,将推理速度提高了10倍以上。Seedance 1.5 Pro通过精准的跨语言和方言唇形同步、动态电影摄像机控制以及增强的故事连贯性,在专业级内容创作引擎中脱颖而出。 Seedance 1.5 Pro现可通过火山引擎在以下链接访问:[此处插入实际URL]。
https://arxiv.org/abs/2512.13507
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
有效奖励函数的设计是强化学习(RL)中的一个核心且常常艰难的挑战,尤其是在为复杂的推理任务开发自主代理时。虽然存在自动化奖励优化的方法,但这些方法通常依赖于无导数演化的启发式算法来处理奖励函数作为黑盒的问题,无法捕捉到奖励结构与任务表现之间的因果关系。为了弥合这一差距,我们提出了可微演化强化学习(DERL),这是一种双层框架,能够实现最优奖励信号的自主发现。 在DERL中,一个元优化器通过组合结构化的原子原语来进化奖励函数(即元奖励),并指导内循环策略的训练。关键在于,与以往的演算法不同,DERL在其元优化过程中是可微分的:它将内循环验证性能视为更新元优化器以强化学习方式传递信号的方法。这使得DERL能够近似“元梯度”,逐渐学会生成更密集和更具操作性的反馈。 我们在三个不同的领域中对DERL进行了验证:机器人代理(ALFWorld)、科学仿真(ScienceWorld)以及数学推理(GSM8k、MATH)。实验结果显示,DERL在ALFWorld和ScienceWorld上达到了最先进的性能,在基于启发式奖励的方法特别是在分布外场景下明显超越。对于演化的轨迹分析表明,DERL成功地捕捉到了任务的内在结构,使得代理能够在没有人类干预的情况下实现自我改进与对齐。 通过这一创新方法,DERL不仅提高了自主学习系统的效率和泛化能力,还展示了演化算法在智能体奖励设计中的潜力,为解决复杂推理任务带来了新的视角。
https://arxiv.org/abs/2512.13399
Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components - grasping style and affordance - and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance. Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.
强化学习(RL)在灵巧抓握方面取得了巨大成功,显著提高了从模拟到真实世界的抓取性能和泛化能力。然而,对于下游操作任务至关重要的细粒度功能抓取仍被研究得较少,并面临诸多挑战:为各种对象指定功能抓取的目标和奖励函数的复杂性、多任务RL探索的难度以及仿真到现实转换的问题。在本工作中,我们提出了DemoFunGrasp用于通用灵巧的功能抓取。我们将功能抓取条件分解成两个互补的部分——抓握风格和姿态,并将其整合进一个能够学会用任意功能性抓取条件对任何对象进行抓取的RL框架中。为了应对多任务优化挑战,我们利用单个抓取演示并重新将RL问题定义为一步演示编辑问题,从而极大地提高了样本效率和性能。在模拟和现实世界中的实验结果显示,DemoFunGrasp能够推广到未见过的对象、姿态以及抓握风格组合,并且在成功率和功能抓取精度上都优于基线模型。此外,通过结合视觉-语言模型(VLM)进行规划,我们的系统实现了自主指令跟随的抓取执行,还具有强大的模拟到现实的能力。 这一工作表明了将单一演示与强化学习相结合的有效性,同时展示了如何利用先进的视觉和语言技术提升机器人系统的泛化能力及对复杂任务的理解。
https://arxiv.org/abs/2512.13380
Autonomous Underwater Vehicles (AUVs) require reliable six-degree-of-freedom (6-DOF) position control to operate effectively in complex and dynamic marine environments. Traditional controllers are effective under nominal conditions but exhibit degraded performance when faced with unmodeled dynamics or environmental disturbances. Reinforcement learning (RL) provides a powerful alternative but training is typically slow and sim-to-real transfer remains challenging. This work introduces a GPU-accelerated RL training pipeline built in JAX and MuJoCo-XLA (MJX). By jointly JIT-compiling large-scale parallel physics simulation and learning updates, we achieve training times of under two this http URL systematic evaluation of multiple RL algorithms, we show robust 6-DOF trajectory tracking and effective disturbance rejection in real underwater experiments, with policies transferred zero-shot from simulation. Our results provide the first explicit real-world demonstration of RL-based AUV position control across all six degrees of freedom.
自主水下航行器(AUV)需要可靠的六自由度(6-DOF)位置控制,才能在复杂多变的海洋环境中有效运行。传统的控制器在正常条件下表现良好,但在遇到未建模的动力学或环境扰动时性能会下降。强化学习(RL)提供了一种强大的替代方案,但训练过程通常很慢,并且从仿真到实际应用的转移仍然具有挑战性。本工作引入了一个基于JAX和MuJoCo-XLA (MJX)构建的、利用GPU加速的RL训练管道。通过联合即时编译大规模并行物理模拟与学习更新,我们实现了不到两天的训练时间。通过对多种RL算法进行系统的评估,我们在真实水下实验中展示了鲁棒的6-DOF轨迹跟踪和有效的扰动抑制能力,并且策略直接从仿真转移到实际应用而无需额外调整。我们的成果首次明确地展示了基于RL的AUV位置控制在所有六个自由度上的现实世界表现。
https://arxiv.org/abs/2512.13359
This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.
本文提出了一种用于控制和稳定双旋翼气动系统(TRAS)在特定俯仰角和方位角,并追踪给定轨迹的强化学习(RL)框架。由于TRAS复杂的动态特性和非线性特征,使用传统的控制算法进行控制具有挑战性。然而,近年来强化学习的发展因其在多旋翼控制系统中的潜在应用而引起了人们的兴趣。 本文采用双延迟深度确定型策略梯度(TD3)算法来训练RL智能体。该算法适用于连续状态和动作空间的环境,类似于TRAS,因为它不需要系统的模型。模拟结果展示了RL控制方法的有效性。接下来,通过使用风扰动等外部干扰的形式测试控制器的效果,并将其与传统的PID控制器进行比较。最后,在实验室环境中进行了实验以验证控制器在实际应用中的有效性。
https://arxiv.org/abs/2512.13356
This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: this https URL.
本文研究了强化学习(RL)在多机器人社交编队导航中的应用,这是实现人机无缝共存的关键能力。尽管强化学习提供了一种有前景的方法框架,但行人的行为固有的不可预测性和往往不合作的动态特性为机器人协调探索带来了重大挑战,特别是在效率方面。为此,我们提出了一种新颖的协作探索多机器人RL算法,引入了内在动机探索机制。该算法的核心组成部分是一个自学习的内在奖励机制,旨在集体减轻策略保守性问题。 此外,此算法在集中的训练和分散执行框架内整合了一个双采样模式,以增强导航策略和内在奖励的表现力,并利用两时间尺度更新规则来解耦参数更新。实证研究表明,在社交编队导航基准测试中,所提出的算法在其关键指标上优于现有的最先进方法。 我们的代码和视频演示可在以下网址获取:[此链接](this https URL)。
https://arxiv.org/abs/2512.13293
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
代理强化学习已经将大型语言模型(LLMs)推进到能够通过使用外部工具的交错来处理长链思维轨迹的能力。现有方法假设有一组固定的工具,限制了LLM代理适应新出现或不断变化的工件集的能力。我们提出了AutoTool框架,该框架使LLM代理在整个推理过程中具备动态选择工具的能力。 首先,我们构建了一个包含20万条数据的数据集,这些数据集中包含了1000多种工具和超过100种任务(涵盖数学、科学、代码生成和多模态推理)的明确工具选择理由。在此数据基础之上,AutoTool采用了双阶段优化管道:(i)通过监督学习和基于强化学习的轨迹稳定化实现连贯推理,并且(ii)使用KL正则化的Plackett-Luce排名来精炼一致的多步骤工具选择。 在包括十个不同基准在内的评估中,我们用AutoTool训练了两个基础模型——Qwen3-8B和Qwen2.5-VL-7B。尽管参数较少,但AutoTool在整个LLM代理和工具集成方法中的性能上始终领先,并且平均提高了数学与科学推理6.4%,基于搜索的问答4.5%,代码生成7.7%以及多模态理解6.9%的成绩。 此外,AutoTool在推断过程中表现出更强的泛化能力,能够动态地利用不断变化工件集中未见过的新工具。
https://arxiv.org/abs/2512.13278
Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
学习多代理之间的交互式运动行为是自动驾驶领域的一个核心挑战。虽然模仿学习模型能够生成现实的轨迹,但它们通常会从以安全演示为主的数据集中继承偏差,这限制了在安全性关键情况下表现的鲁棒性。此外,大多数研究依赖于开环评估方法,忽略了闭环执行中的累积误差问题。 为了解决这些局限性,我们采用了两种互补策略。首先,我们提出了组相对行为优化(GRBO),这是一种强化学习后期训练方法,通过组间的相对优势最大化以及人类规范化的手段来微调预训练的行为模型。使用仅10%的训练数据集,GRBO在保持行为真实性的同时,将安全性表现提高了超过40%。 其次,我们引入了Warm-K策略,这是一个带有热启动的Top-K采样方法,能够平衡运动选择的一致性和多样性。基于我们的Warm-K测试时间缩放法,在不重新进行训练的情况下,能够在测试时提升行为一致性与响应性,并且能缓解协变量变化和减少性能差异。 演示视频可在补充材料中查看。
https://arxiv.org/abs/2512.13262
Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
直接偏好优化(DPO)作为一种轻量级且有效的替代方案,已经从人类反馈的强化学习(RLHF)和人工智能反馈的强化学习(RLAIF)中脱颖而出,用于校准大规模语言模型和视觉-语言模型。然而,标准的DPO方法,即由同一策略生成的选择响应和拒绝响应,在训练时会因为两个响应常常共享类似的错误而产生较弱的学习信号,并且它们之间的Kullback-Leibler(KL)散度较小,这导致了缓慢而不稳定的收敛过程。 为了解决这个问题,我们引入了一种新的框架——反思偏好优化(RPO),该框架在DPO的基础上加入了基于提示的反思机制。RPO利用外部模型识别幻觉来源并生成简洁的反思提示,从而构建具有更强对比度和更清晰偏好信号的策略内偏好对。 理论上,条件反射式地使用这些提示通过互信息增加了期望偏好的边界,并提高了样本效率,同时保持在政策分布家族之内。经验上,RPO在需要较少训练样例和迭代次数的情况下实现了更好的校准效果,显著降低了幻觉率,并且在整个多模态基准测试中表现出了最先进的性能水平。
https://arxiv.org/abs/2512.13240
Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $\tau$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.
软演员批判性(Soft Actor-Critic,简称 SAC)方法在实际应用中被广泛使用,并且已成为最相关的无策略在线模型自由强化学习(RL)方法之一。n步回报技术被认为能提高强化学习算法的收敛速度,相较于基于1步回报版本的算法来说更为迅速。然而,将SAC与n步回报结合却是一项颇具挑战性的任务,因为这种组合通常会在动作分布变化时导致策略评估中的偏差问题。尽管重要性采样方法可以解决这一偏差引入的问题——它是一种通过一个分布的样本估计另一个分布预期值的方法——但这种方法可能会引起数值上的不稳定性。 在这项工作中,我们提出了一种将SAC与n步回报结合的新方式,以克服上述问题。我们介绍了一个采用数值稳定的重要性采样并简化了超参数选择的应用方法。此外,在考虑n步最大熵框架时,我们分析了Soft Actor-Critic的熵估计方法,并提出了$\tau$样本熵估计法来降低学习目标中的方差。 最后,我们制定了带n步回报的Soft Actor-Critic算法(SAC$n$),并在MuJoCo模拟环境中通过实验验证其效果。
https://arxiv.org/abs/2512.13165
Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
有效的人类与代理协作在现实世界应用中越来越普遍。当前此类合作的趋势主要为单向,用户给代理提供指令或提出问题,而代理则直接回复而不寻求必要的澄清或确认。然而,这些代理的进化能力要求更积极地参与对话,以便动态参与到对话中去,以澄清用户的意图、解决歧义并适应变化的情况。现有的先驱工作未能充分利用语言模型(LM)的对话能力,因此优化了代理作为更好的跟随者而非有效的发言者的角色。在此项工作中,我们引入SpeakRL,这是一种强化学习(RL)方法,通过奖励积极与用户互动的行为来增强代理的对话能力,例如在必要时提出恰当的澄清问题。为此,我们整理了SpeakER,这是一个合成数据集,其中包括来自任务导向对话的各种场景,在这些场景中,任务是通过交互式的澄清提问得以解决的。我们还对促进对话主动性的奖励设计进行了系统的分析,并提出了一个原则化的奖励公式来教授代理在询问与行动之间取得平衡。实证评估表明,我们的方法实现了20.14%的任务完成率相对于基线模型绝对提升,在不增加会话轮次的情况下甚至超越了更大规模的专有模型,这证明了以澄清为中心的人机互动的巨大潜力。
https://arxiv.org/abs/2512.13159