Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
层次结构的目标条件强化学习(H-GCRL)为解决复杂的长期任务提供了一个强大的框架,通过将这些任务分解成有组织的子目标来实现。然而,其实际应用受到数据效率低下和策略表达能力有限的限制,尤其是在离线或数据稀缺的情况下更为明显。本文介绍了Normalizing flow-based hierarchical implicit Q-learning(NF-HIQL),这是一种创新性框架,它在层次结构的高层和低层中用具有高度表现力的正则化流策略替换了单模态高斯策略。 这种设计使得对数似然计算变得可行、采样效率提高,并且能够模拟丰富的多模态行为。研究还推导出了新的理论保证,包括Real-valued非体积保持(RealNVP)策略的显式KL散度界限和PAC式的样本效率结果,证明NF-HIQL在维持稳定性的同时提升了泛化能力。 实验方面,NF-HIQL在OGBench中的多种长期任务中进行了评估,这些任务涵盖了行走、球控以及多步操作。结果显示,无论是在哪些环境中,NF-HIQL都优于先前的目标条件和层次基线模型,在数据有限的情况下表现出更强的鲁棒性,并强调了基于流架构对可扩展且高效的数据驱动分层强化学习具有潜力。
https://arxiv.org/abs/2602.11142
Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
基于强化学习(RL)的后训练方法,用于提高多模态大规模推理模型(MLRMs)的明确思维链能力(如GRPO),可以增强这些模型的推理能力。然而,近期证据表明这种方法同时会降低安全性对齐并增加破解成功概率。为此我们提出了SafeThink,这是一种轻量级的推理时间防御方法,它将安全恢复视为一种满足性约束而非最大化目标。SafeThink通过一个安全奖励模型监控不断发展的推理轨迹,并在违反安全阈值时有条件地注入优化后的简短纠正前缀(“等等,安全思考”)。 我们在六种开源MLRMs和四个破解基准测试集上进行了评估(JailbreakV-28K、Hades、FigStep 和 MM-SafetyBench)。结果显示,SafeThink在减少攻击成功率方面表现优异,减少了30%-60%的成功率(例如:LlamaV-o1 在 JailbreakV-28K 上从 63.33% 下降到 5.74%,R1-Onevision 在 Hades 上从 69.07% 下降到 5.65%),同时保持了推理性能不变(MathVista 准确率:从 65.20% 维持在 65.00%)。 我们实验的关键实证发现之一是,安全恢复往往只需要几步引导即可实现。通常情况下,在前1-3个推理步骤中的干预就足以将整个生成过程转向安全的结果。
https://arxiv.org/abs/2602.11096
In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
在大型语言模型(LLM)当前的发展格局中,大规模高质量训练数据的策划是提高模型性能的主要驱动力。其中的关键因素是一个称为“数据配方”(data recipe)的概念,它包含了一套用于将原始来源转换为训练语料的数据处理流程。尽管越来越多地使用LLMs来自动化单个数据处理步骤(如数据合成和过滤),但设计整个数据配方的过程仍然主要依赖人工且劳动密集型,需要大量的人类专业知识和反复试验。为了弥合这一差距,我们提出了针对LLM适应的“端到端数据配方生成”方法。给定一个目标基准和一组可用的数据源,模型需输出一个完整的数据配方,以将基础LLM适配到特定的任务中。 在此框架下,我们介绍了DataChef-32B,它使用代理奖励进行在线强化学习,该代理奖励可以预测候选数据配方的下游性能。在六个独立任务上,DataChef-32B生成的数据配方达到了与人类专家策划方案相当的下游性能。尤为值得注意的是,由DataChef-32B生成的配方将Qwen3-1.7B-Base模型适配到数学领域,在AIME'25上的得分达到66.7,超过了原生的Qwen3-1.7B版本。 这项工作为自动化LLM训练和开发自我演进的人工智能系统提供了新的视角。
https://arxiv.org/abs/2602.11089
Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO)}, which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization. A URLLC case study shows AE-MAPPO resolves a latency spike in $18$ms, restores latency to $0.98$ms with $99.9999\%$ reliability, and reduces troubleshooting time by $93\%$ while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing.
第六代(6G)无线接入网(RAN)必须为异构切片执行严格的服务水平协议(SLAs),然而,使用传统的深度强化学习(DRL)或可解释的RL(XRL)来诊断和解决突发的延迟峰值仍然非常困难。我们提出了\emph{增强注意力的多智能体近端策略优化(AE-MAPPO)}方法,该方法将六种专门化的注意机制集成到多代理切片控制中,并以零成本的形式提供忠实解释。此框架在O-RAN时间尺度上运行,采用三阶段策略:预测、响应和跨切片优化。 通过一个URLLC案例研究显示,AE-MAPPO可以在18毫秒内解决延迟峰值问题,将延迟恢复到0.98毫秒,并且以99.9999%的可靠性保持eMBB和mMTC服务连续性。同时,它将故障排除时间减少了93%,并维持了其他业务类型的稳定性。这些结果确认了AE-MAPPO能够结合SLA合规性和固有的可解释性,为6G RAN切片提供值得信赖且实时的自动化解决方案。 此技术的进步解决了当前无线网络在面对突发延迟挑战时难以自动诊断和解决问题的关键问题,并为未来通信网路提供了更加可靠、高效的运行保障。
https://arxiv.org/abs/2602.11076
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
当前的大规模视觉-语言模型(LVLMs)通常依赖于基于单次视觉编码的纯文本推理,这往往导致细粒度视觉信息的丢失。最近提出的“通过图像思考”方法试图通过使用外部工具或代码来操作图像以缓解这一限制;然而,由此产生的视觉状态常常未能充分扎根于语言语义之中,从而影响了跨模态对齐的效果——特别是在需要跨越遥远区域或多张图片进行视觉语义推理和几何关系分析时。为了解决这些挑战,我们提出了“与图像对话”,这是一种新框架,将视觉操作重新定义为由表达性语言提示引导的特征调制。在丰富语言提示的指导下,模型可以动态地对多个图像区域执行联合再编码,从而实现了语言推理与视觉状态更新之间的更紧密耦合。 我们将这一理念实现在ViLaVT中,这是一种新型LVLM,配备了专为这种交互式视觉推理设计的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练来促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT取得了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现出特别突出的优势。
https://arxiv.org/abs/2602.11073
Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
带有可验证奖励的强化学习(RLVR)能提高大型语言模型(LLM)的推理能力,但越来越多的证据表明存在探索上限:它经常重新加权现有的解轨迹,而不是发现新的策略,这在大量采样预算下(例如pass-at-256)限制了收益。为了解决这一局限性,我们提出了PSN-RLVR方法,该方法通过扰动策略参数来诱导时间上一致、轨迹级别的探索,在保持长期思维链一致性方面优于动作空间噪声。为了缓解由此产生的采样更新不匹配问题,我们引入了截断的重要采样(TIS)。为了避免昂贵的基于KL的方法自适应噪音控制,我们提出了一种计算效率高的实时自适应噪音调度器,该调度器由结合语义多样性与归一化自我确信度的轻量级代理驱动。在广泛使用的RLVR方法GRPO上实现PSN-RLVR后,在多个数学推理基准和模型族中,PSN-GRPO能一致地扩展有效推理能力边界,从而在大量采样预算下获得更高的pass-at-k,并超越了此前探索导向的RLVR方法(如基于pass-at-k风格训练的方法),同时保持正交性并可组合以获取额外收益。
https://arxiv.org/abs/2602.02555
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
开发高效的GPU内核对于扩展现代AI系统至关重要,但由于复杂的硬件架构和对专业优化技能的需求,这仍然是一个复杂任务。尽管大型语言模型(LLMs)在生成通用顺序代码方面表现出强大的能力,但在生成GPU代码时却面临重大挑战,原因是高质量标记训练数据的稀缺、合成解决方案生成中的编译器偏见以及跨硬件世代泛化的能力有限。这些因素阻碍了监督微调(SFT)成为改进当前LLM的有效可扩展方法。相比之下,强化学习(RL)提供了一种数据效率高且适应性强的替代方案,但需要访问相关工具、仔细选择训练问题,并建立一个稳健的评估环境。 我们提出了Makora的环境和工具,用于前沿模型的强化学习微调,并报告了在Triton代码生成方面对GPT-5进行微调的结果。在单次尝试设置下,我们的微调模型将内核正确性从43.7%提高到77.0%(+33.3个百分点),并将优于PyTorch TorchInductor的问题比例从14.8%提升至21.8%(+7个百分点),与基线GPT-5相比,同时在KernelBench上超过了之前的最先进模型。当集成到完整的编程代理中时,它可以解决扩展后的KernelBench套件中的97.4%的问题,并且在72.9%的问题上优于PyTorch TorchInductor编译器,几何平均速度提升达到了2.12倍。 我们的工作证明了,在数据可用性限制传统监督学习能力的高度专业化技术领域中,使用强化学习进行有针对性的后训练能够解锁LLM的能力。这为AI辅助加速器编程开辟了新的途径。
https://arxiv.org/abs/2602.11000
Board games have long served as complex decision-making benchmarks in artificial intelligence. In this field, search-based reinforcement learning methods such as AlphaZero have achieved remarkable success. However, their significant computational demands have been pointed out as barriers to their reproducibility. In this study, we propose a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning. To validate the efficiency of the proposed method, we conducted comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The results demonstrate that the proposed method achieves more efficient learning than existing methods across these environments. In addition, our extensive ablation study shows the importance of core techniques used in the proposed method. We believe that our efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.
长期以来,棋盘游戏一直是人工智能复杂决策能力的基准。在这一领域中,基于搜索的强化学习方法(如AlphaZero)已经取得了显著的成功。然而,这些方法所需的大量计算资源被指出是阻碍其可重复性的障碍之一。在这项研究中,我们提出了一种针对棋盘游戏设计的无模型强化学习算法,以实现更高效的训练过程。 为了验证所提方法的有效性,我们在五款不同的棋盘游戏中进行了全面实验:动物将棋(Animal Shogi)、加德纳国际象棋(Gardner Chess)、围棋(Go)、六子棋(Hex)和黑白棋(Othello)。结果表明,在这些环境中,我们的方法比现有的强化学习方法实现了更高效的训练过程。此外,我们详尽的消融实验展示了所提方法中核心技术的重要性。 我们认为,我们的高效算法显示了无模型强化学习在以往由基于搜索的方法主导领域中的潜力,并且它为未来的研究提供了一个新的方向。
https://arxiv.org/abs/2602.10894
Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: this https URL
视觉-语言模型(VLMs)在从图表图像生成绘图代码方面展现出潜力,但实现结构保真度仍然具有挑战性。现有的方法主要依赖于监督微调,鼓励表面级别的标记模仿而不是对底层图表结构的忠实建模,这往往导致幻觉或语义不一致的输出。我们提出了图表规范(Chart Specification),这是一种结构化的中间表示形式,它将训练从文本模仿转移到基于语义的基础指导上。图表规范过滤了句法噪声以构建一个结构平衡的训练集,并支持一种Spec-Align奖励机制,该机制提供了关于结构正确性的细粒度且可验证的反馈,从而使强化学习能够强制执行一致的绘图逻辑。 在三个公开基准测试上的实验表明,我们的方法始终优于先前的方法。使用仅3K个训练样本,我们实现了很强的数据效率,在复杂基准上比领先的基线提高了高达61.7%的表现,并扩展到4K个样本后,我们在所有评估指标上都建立了新的最先进的结果。 总体而言,我们的研究结果表明,精确的结构化监督为高保真度的图表转代码生成提供了一条高效的途径。代码和数据集可在以下网址获得:this https URL
https://arxiv.org/abs/2602.10880
Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual-native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high-dimensional observations, we introduce Information-Aware Credit Assignment (ICA), a post-hoc method that estimates each retrieved snapshot's contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO-based training pipeline, our approach consistently outperforms text-based baselines on diverse information-seeking benchmarks, providing evidence that visual snapshot grounding with information-level credit assignment alleviates the credit-assignment bottleneck in open-ended web environments. The code and datasets will be released in this https URL.
尽管通过强化学习训练的信息寻求代理表现出色,但在开放式的网络环境中进行学习仍然受到低信噪比反馈的严重限制。基于文本的解析器经常忽略布局语义并引入无结构噪音,而长时域训练通常依赖于稀疏的结果奖励,这会掩盖哪些检索行为实际上是有意义的。我们提出了一种以视觉为基础的搜索框架,该框架将网页表示为视觉快照,使代理能够利用布局线索快速定位重要的证据,并抑制干扰信息。为了从这些高维度观察中有效地学习,我们引入了信息感知信用分配(ICA),这是一种事后方法,通过后验分析估算每个检索到的快照对最终结果的贡献,并向关键搜索阶段传递密集的学习信号。 结合基于GRPO的训练流水线,我们的方法在各种信息寻求基准测试上持续优于文本基线,证明了视觉快照定位与信息级别信用分配可以缓解开放网络环境中的信用分配瓶颈。代码和数据集将在提供的链接中发布。
https://arxiv.org/abs/2602.10863
The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at this https URL.
通过后训练调整大规模视觉-语言模型(VLM)时,表现出了显著的泛化差距:使用强化学习(RL)微调的模型在处理非分布数据集(OOD)方面始终优于监督微调(SFT)。本文提出了一种以数据为中心的解释来说明这种现象,认为强化学习的泛化优势源自一种隐含的数据过滤机制,该机制内在地优先选择中等难度的训练样本。为了验证这一假设,我们系统性地评估了不同难度水平训练集上监督微调模型的OOD泛化能力。我们的研究结果确认数据难度是一个关键因素,并揭示在困难样本上的训练会显著降低OOD性能。 基于这一发现,我们引入了一种名为“难度筛选式监督微调”(DC-SFT)的方法,这是一种直接根据样本难度过滤训练集的简单方法。实验表明,与标准SFT相比,DC-SFT不仅大大增强了OOD泛化能力,并且超过了基于RL训练的表现,同时提供了更高的稳定性和计算效率。 这项工作为VLMs中的OOD泛化差距提供了一个数据驱动的解释,并建立了一条更高效的路径来实现稳健的泛化。代码可在[此链接](https://example.com)获取。
https://arxiv.org/abs/2602.10815
Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.
深度强化学习中的迁移学习通常由改进的稳定性以及降低训练成本所驱动,但在显著的任务域变化下可能会失败。本文通过一项受控实证研究,探讨了双层深度Q网络(DDQN)和斗士DQN之间架构差异如何影响跨环境中的迁移行为。使用CartPole作为源任务,并将LunarLander作为一个结构上明显不同的目标任务,在相同的超参数和训练条件下评估了一个固定的逐层表示迁移协议。我们还利用从头开始训练的基线智能体来定位迁移效果。实证结果表明,DDQN在观察到的设置下始终避免了负向迁移,并且其学习动态与目标环境中的基线性能相当。相比之下,在相同的条件下,斗士DQN则持续表现出负面迁移,表现为奖励下降和优化行为不稳定。多个随机种子上的统计分析确认了转移过程中的显著性能差距。这些发现表明,在所研究的迁移协议下,基于价值的深度强化学习中架构性的归纳偏差与跨环境迁移时的鲁棒性之间存在强相关性。
https://arxiv.org/abs/2602.09810
Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
通过自回归模型进行生成式推荐已经将检索和排名统一为单一的条件生成框架。然而,使用强化学习(RL)对这些模型进行微调时常会遇到根本的概率-奖励不匹配问题。传统的可能性主导解码方法(如束搜索)倾向于局部概率前缀,并表现出一种短视偏差,这导致两个关键失败:(1) 探索不足,其中低概率分支中的高回报项被过早剪枝且极少被采样;(2) 优势压缩,在具有高概率前缀的轨迹之间共享的高度相关的奖励会导致组内方差小,从而为RL提供弱比较信号。 为了应对这些挑战,我们提出了V-STAR框架,这是一种基于价值引导抽样和树状结构优势强化的学习方法。V-STAR通过两个协同组件形成自我演化的循环: 1. 首先开发了价值导向高效解码(VED),用以识别决定性节点,并选择性地加深具有高潜力的前缀。这种方法在不进行完全遍历搜索的情况下提高了探索效率。 2. 其次,我们提出了兄弟优势GRPO方法,利用生成的树结构来计算相对于兄弟的优势并集中学习信号到决定性的分支决策上。 广泛的离线和在线数据集实验表明,V-STAR优于最先进的基准模型,在严格的延迟限制下提供了更高的准确性和候选项多样性。
https://arxiv.org/abs/2602.10699
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at this https URL
训练稳定性仍然是在大型语言模型(LLM)中进行强化学习(RL)的核心挑战。策略陈旧、异步训练以及训练和推理引擎之间的不匹配都会导致行为策略与当前策略产生偏差,从而有训练崩溃的风险。重要性采样提供了一种针对这种分布变化的原则性修正方法,但其方差较高;现有的缓解措施,如令牌级裁剪和序列级归一化,则缺乏统一的理论基础。 我们提出了变分序列软策略优化(VESPO)。通过在提议分布上引入方差缩减的变分形式化,VESPO 衍生出一个闭式重塑核,该核直接作用于序列级别的重要性权重,而不需要进行长度归一化。实验证明,在数学推理基准测试中,VESPO 能够在陈旧比高达64倍和完全异步执行的情况下保持稳定的训练,并且无论是在密集模型还是专家混合模型上都能持续带来改进。 代码可在提供的链接处获取:[请在此处插入实际的URL]
https://arxiv.org/abs/2602.10693
Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos at this https URL.
开发能够开放性地发现和学习新技能的代理是人工智能领域的一项重大挑战。虽然强化学习为训练智能体掌握复杂技能提供了一个强大的框架,但它通常依赖于手工设计的奖励函数。这在开放式技能发现中是不可行的,在这种情况下,有意义的技能集合事先并不为人所知。尽管最近的方法显示了自动设计奖励函数的前景,但它们仍然局限于优化预定义任务的奖励。为了解决这一局限性,我们引入了一个新的框架:连续开放性技能发现和演化的层级回报程序(Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs, CODE-SHARP),该框架利用基础模型(Foundation Models, FM)来无限制地扩展并细化一个结构化为可执行奖励函数代码的有向图形式的层级技能档案。我们展示了一个完全基于由发现的SHARP技能生成的回报训练的目标条件代理,能够在Craftax环境中解决越来越长时序目标的问题。当通过高级FM基线规划器组合时,所发现的这些技能使一个单独的目标条件智能体能够解决复杂的长期任务,在性能上超过了预训练的智能体和特定任务的专家策略134%以上。我们将开源我们的代码,并在提供的链接中提供额外的视频演示。
https://arxiv.org/abs/2602.10085
Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.
现有的伪造检测方法通常局限于单模态或双模态设置,无法处理真实世界中普遍存在且交错出现的文本、图像和视频中的错误信息。为了解决这一问题,本文旨在开发一个统一框架,用于全方位的视觉语言伪造检测和定位(grounding)。在这个统一设置下,不同模态之间的交互以及同时进行检测和定位所面临的双重需求导致了一个关键性的“难度偏差”问题:简单的真伪分类任务倾向于主导梯度,从而在多任务优化过程中导致精细定位表现不佳。为了解决这一挑战,我们提出了**OmniVL-Guard**,这是一个平衡的强化学习框架,用于全方位的视觉语言伪造检测和定位。 具体而言,OmniVL-Guard 包含两个核心设计:自我演化的CoT生成(Self-Evolving CoT Generation)和自适应奖励缩放策略优化(Adaptive Reward Scaling Policy Optimization, ARSPO)。自我演化的CoT生成合成高质量的推理路径,有效克服了冷启动挑战。在此基础上,自适应奖励缩放策略优化(ARSPO)动态调节奖励尺度和任务权重,确保平衡的联合优化。 广泛的实验表明,OmniVL-Guard 显著优于现有方法,并在跨域场景中展示了零样本鲁棒泛化性能。
https://arxiv.org/abs/2602.10687
Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
基于人类偏好的奖励模型是通过人类反馈强化学习对大型语言模型(LLM)进行校准的核心,然而它们通常由于噪音标注和诸如响应长度或风格等系统偏差而容易受到奖励操纵的威胁。为此,我们提出了贝叶斯非负奖励模型(BNRM),这是一种将非负因子分析与布拉德利-特里偏好模型相结合的原则性奖励建模框架。BNRM通过稀疏、非负隐变量生成过程来表示奖励,并在两个互补层次上运作:实例特定的隐变量诱导出解耦的奖励表示,而全局隐变量上的稀疏度则充当一种隐式去偏机制,抑制虚假相关性。这种先解耦后去偏结构使得BNRM能够在有不确定性的情况下进行稳健的学习。 为了将BNRM扩展到现代大型语言模型中,我们开发了一个基于深度模型表示条件下的近似变分推理网络,从而允许高效端到端训练。大量的实证结果表明,与强大的基准方法相比,BNRM在很大程度上缓解了奖励过度优化的问题,在分布变化下提高了鲁棒性,并提供了更具解释性的奖励分解。 这段文字描述了一种新型的贝叶斯非负奖励模型(BNRM),旨在解决大型语言模型通过人类反馈进行强化学习时所遇到的挑战。该方法不仅增强了对系统偏差和噪音的抵抗力,而且还能提高奖励模型的学习效率和可解释性,从而进一步优化了大型语言模型的行为表现和适应能力。
https://arxiv.org/abs/2602.10623
Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
对于大型语言模型的强化学习而言,高方差的令牌级别重要性采样(IS)比率会导致策略优化在大规模时变得不稳定。为了提高稳定性,近期的方法通常会为序列中的所有令牌使用一个固定的序列级别的IS比率或者单独调整每个令牌的IS比率,从而忽略了序列中相邻令牌之间的时序脱策偏差问题。在这篇论文中,我们首先实证地发现局部脱策偏差在令牌级别上具有结构性的不一致性,这可能扭曲了相邻令牌之间的策略梯度更新,并导致训练崩溃。 为了解决这个问题,我们提出了在线因果卡尔曼滤波器(KPO)用于稳定且有效的策略优化。具体来说,我们将所需的IS比率建模为一个随着令牌变化而演化的潜在状态,并应用卡尔曼滤波器根据过去令牌的状态进行在线和自回归更新,而不考虑未来的令牌。经过这样的过滤处理之后的IS比率既能保留每个令牌级别的局部结构感知变动又能有效平滑噪声峰值,从而提供更加稳定且有效的策略更新。 在实验中,KPO方法在具有挑战性的数学推理数据集上取得了优于当前最佳方法的结果。
https://arxiv.org/abs/2602.10609
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
我们引入了Step 3.5 Flash,这是一种稀疏的专家混合模型(MoE),它在前沿级别的代理智能和计算效率之间架起了桥梁。我们的重点在于构建代理时最重要的两个方面:敏锐的推理能力和快速、可靠的执行能力。Step 3.5 Flash 结合了一个基础的1960亿参数模型与110亿个活跃参数,以实现高效的推断过程。它通过交错使用3:1滑动窗口/全注意力机制和多令牌预测(MTP-3)进行优化,从而减少多轮代理交互中的延迟和成本。 为了达到前沿级别的智能,我们设计了一个可扩展的强化学习框架,该框架结合了可验证信号与偏好反馈,并且能够在大规模离策略训练下保持稳定,使得在数学、代码和工具使用方面能够持续自我改进。Step 3.5 Flash 在代理任务、编程任务和数学任务中表现出色,在IMO-AnswerBench上得分为85.4%,LiveCodeBench-v6(2024.08-2025.05)得分为86.4%,tau2-Bench得分高达88.2%,在BrowseComp(带上下文管理)任务中得分为69.0%,以及在Terminal-Bench 2.0中的成绩为51.0%。这些结果与前沿模型如GPT-5.2 xHigh和Gemini 3.0 Pro相当。 通过重新定义效率边界,Step 3.5 Flash 为在现实世界工业环境中部署复杂代理提供了一个高密度的基础框架。
https://arxiv.org/abs/2602.10604
Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.
深度强化学习(DRL)在训练和执行过程中可能会探索不可行的动作。现有方法通常假设存在一种符号接地函数,该函数将高维状态映射到一致的符号表示,并且还使用手动指定的动作屏蔽技术来约束动作。本文中,我们提出了一种新颖的框架——神经符号行动掩码(NSAM),它能够以最少监督的方式,在DRL过程中自动学习与给定领域内的高维状态约束相一致的符号模型。基于学到的状态符号模型,NSAM 学习出一种规则,用以排除不可行动作。 NSAM 使得符号推理和深度策略优化之间的端到端集成成为可能,其中符号接地和策略学习方面的改进相互强化。我们在多个具有约束条件的领域中对 NSAM 进行了评估,并通过实验结果证明,NSAM 显著提高了 DRL 代理人的样本效率,并且大幅减少了违反约束的情况发生次数。
https://arxiv.org/abs/2602.10598