Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
偏好优化对于扩散模型和流匹配模型依赖于既具有判别性又计算效率高的奖励函数。视觉-语言模型(VLMs)已成为主要的奖励提供者,通过利用其丰富的多模态先验知识来指导对齐过程。然而,它们的计算成本和内存消耗可能相当大,并且在像素空间中通过奖励优化潜在扩散生成器会导致域不匹配的问题,从而增加对齐难度。在这篇论文中,我们提出了DiNa-LRM(Diffusion-Native Latent Reward Model),这是一种原生的潜在奖励模型,直接在噪声扩散状态上定义偏好学习过程。我们的方法引入了经过校准的Thurstone似然函数,并根据扩散噪音依赖性来估计不确定性。DiNa-LRM利用了一个预训练的潜在扩散主干网络和一个条件于时间步长的奖励头部,并支持推理时的噪声集成,从而提供了一种测试时扩缩和稳健奖励的原生机制。在图像对齐基准上,DiNa-LRM显著优于现有的基于扩散模型的奖励基线,在计算成本大幅减少的情况下达到了与最先进的视觉-语言模型相当的表现水平。在偏好优化方面,我们展示了DiNa-LRM改进了偏好优化动力学过程,使得模型能够更快速且资源效率更高地进行对齐。 总结来说,这项工作提出了一种新颖的方法来解决扩散模型和流匹配模型中的奖励函数设计问题,并证明其在实际应用中具有显著的性能优势。
https://arxiv.org/abs/2602.11146
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
当前大型语言模型(LLM)开发的主流范式是先预训练一个基础模型,然后进行进一步训练以提升性能和模型表现。然而,超参数优化和缩放规律的研究主要从基础模型验证损失的角度出发,忽略了下游任务的适应能力。在本研究中,我们从模型可塑性的角度来探讨预训练的问题,即基础模型通过微调成功适应下游任务的能力。我们重点研究了权重衰减的作用,这是预训练期间的一个关键正则化参数。通过系统性实验,我们发现使用较大权重衰减值进行训练的模型更具可塑性,这意味着它们在微调到下游任务时表现出更大的性能提升。这种现象可能导致一些反直觉的权衡:即经过预训练后表现较差的基础模型,在微调后的表现反而更好。进一步研究权重衰减对模型行为机制效应揭示,它鼓励线性可分表示、正则化注意力矩阵,并减少对训练数据的过拟合。总之,这项工作展示了在超参数优化中使用交叉熵损失之外的评估指标的重要性,并阐明了单个优化超参数在塑造模型行为中的多方面作用。
https://arxiv.org/abs/2602.11137
Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.
离线强化学习(RL)算法的目标是改进生成收集数据的行为策略,同时限制所学策略在数据集的支持范围内。然而,实际的离线数据集中往往包含多样性较低或环境探索有限的例子,并且来自多个具有不同专业水平的行为策略。这种有限的探索会损害离线RL算法估计Q值或V值的能力,而向多样化行为策略约束则可能过于保守。这样的数据集需要在RL目标和行为策略限制之间找到平衡。 首先,我们通过更一般的线性规划(LP)形式以及凸共轭的形式来识别$f$-散度与贝尔曼残差优化约束之间的联系。随后,我们引入了一种基于离线训练数据集的灵活函数公式以适应算法学习目标上的自适应约束条件的$f$-散度。在MuJoCo、Fetch和AdroitHand环境中的实验结果证明了所提出的LP形式的有效性,并展示了灵活使用$f$-散度改进从具有挑战性的数据集中学习性能的巨大潜力,当应用于相容的受限优化算法时。 简而言之,通过引入一种基于离线训练数据集的自适应约束条件的$f$-散度公式,在RL目标和行为策略限制之间找到了一个平衡,并展示了其在复杂环境中的实际应用潜力。
https://arxiv.org/abs/2602.11087
Biometric footstep recognition, based on a person's unique pressure patterns under their feet during walking, is an emerging field with growing applications in security and safety. However, progress in this area has been limited by the lack of large, diverse datasets necessary to address critical challenges such as generalization to new users and robustness to shifts in factors like footwear or walking speed. The recent release of the UNB StepUP-P150 dataset, the largest and most comprehensive collection of high-resolution footstep pressure recordings to date, opens new opportunities for addressing these challenges through deep learning. To mark this milestone, the First International StepUP Competition for Biometric Footstep Recognition was launched. Competitors were tasked with developing robust recognition models using the StepUP-P150 dataset that were then evaluated on a separate, dedicated test set designed to assess verification performance under challenging variations, given limited and relatively homogeneous reference data. The competition attracted global participation, with 23 registered teams from academia and industry. The top-performing team, Saeid_UCC, achieved the best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy. Overall, the competition showcased strong solutions, but persistent challenges in generalizing to unfamiliar footwear highlight a critical area for future work.
基于行走时脚下独特压力模式的生物步态识别技术,是一个在安全和保障领域应用日益广泛的新兴领域。然而,在这一领域的进展受到了缺乏大型、多样化数据集的限制,这些数据集对于解决诸如泛化到新用户以及对鞋类或步行速度变化等挑战至关重要。最近发布的UNB StepUP-P150 数据集是迄今为止最大的高分辨率步态压力记录集合,为通过深度学习解决这些问题带来了新的机会。 为了纪念这一里程碑,举办了第一届国际StepUP生物步态识别竞赛。参赛者被要求利用StepUP-P150数据集开发出能够抵御挑战性变化的稳健识别模型,并在独立且专门设计的测试集中评估其验证性能,在此测试中只提供了有限且相对同质化的参考数据。 此次比赛吸引了来自学术界和工业界的23支队伍参赛。表现最佳的团队Saeid_UCC使用生成奖励机(GRM)优化策略取得了10.77%的最佳等错误率(EER)。总体来看,竞赛展示了强大的解决方案,但持续存在的问题是在不熟悉鞋类条件下的泛化困难,这表明未来的工作需要重点关注这一领域。
https://arxiv.org/abs/2602.11086
Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO)}, which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization. A URLLC case study shows AE-MAPPO resolves a latency spike in $18$ms, restores latency to $0.98$ms with $99.9999\%$ reliability, and reduces troubleshooting time by $93\%$ while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing.
第六代(6G)无线接入网(RAN)必须为异构切片执行严格的服务水平协议(SLAs),然而,使用传统的深度强化学习(DRL)或可解释的RL(XRL)来诊断和解决突发的延迟峰值仍然非常困难。我们提出了\emph{增强注意力的多智能体近端策略优化(AE-MAPPO)}方法,该方法将六种专门化的注意机制集成到多代理切片控制中,并以零成本的形式提供忠实解释。此框架在O-RAN时间尺度上运行,采用三阶段策略:预测、响应和跨切片优化。 通过一个URLLC案例研究显示,AE-MAPPO可以在18毫秒内解决延迟峰值问题,将延迟恢复到0.98毫秒,并且以99.9999%的可靠性保持eMBB和mMTC服务连续性。同时,它将故障排除时间减少了93%,并维持了其他业务类型的稳定性。这些结果确认了AE-MAPPO能够结合SLA合规性和固有的可解释性,为6G RAN切片提供值得信赖且实时的自动化解决方案。 此技术的进步解决了当前无线网络在面对突发延迟挑战时难以自动诊断和解决问题的关键问题,并为未来通信网路提供了更加可靠、高效的运行保障。
https://arxiv.org/abs/2602.11076
We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulates layer-wise compression allocation as a multi-choice knapsack problem, selecting the optimal compression level for each layer to minimize total reconstruction error while adhering to a target model size. Second, it introduces a single-step sparse matrix factorization inspired by dictionary learning: using only a small calibration set, it sparsifies weight coefficients based on activation-weights sensitivity and then updates the dictionary in closed form via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50\% compression rates. Notably, it retains over 90\% of the original model's performance at 30\% compression without any fine-tuning. Moreover, when applying a light fine-tuning phase, recovery is substantially enhanced: for instance, compressing Qwen3-14B to an 8B-parameter model and healing it with just 30 million tokens yields performance nearly on par with the original Qwen3-8B. The code for ROCKET is at this http URL.
我们介绍了ROCKET,这是一种无需训练的模型压缩方法,在与因子分解、结构化稀疏化和动态压缩基准相比时,实现了最先进的性能。在全局压缩预算下运行,ROCKET包含两项关键创新:首先,它将逐层压缩分配表述为一个多选择背包问题,根据目标模型大小选择每个层次的最佳压缩级别以最小化总体重构误差;其次,它引入了一种基于字典学习的单步稀疏矩阵分解方法:仅使用一个小的校准集,在激活权重敏感度的基础上稀疏化权重系数,并通过直接计算(而非迭代优化、稀疏编码或反向传播)更新字典。在20-50%的压缩率下,ROCKET在不同的模型架构中始终优于现有的压缩方法。值得注意的是,在不进行任何微调的情况下,其在30%的压缩率下仍能保持原始模型性能的90%以上。此外,在应用轻量级微调阶段时,恢复效果显著增强:例如,将Qwen3-14B模型压缩至8B参数模型,并仅使用30百万令牌进行修复后,其表现几乎与原版Qwen3-8B相当。ROCKET的代码位于此链接:[http URL]。
https://arxiv.org/abs/2602.11008
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
开发高效的GPU内核对于扩展现代AI系统至关重要,但由于复杂的硬件架构和对专业优化技能的需求,这仍然是一个复杂任务。尽管大型语言模型(LLMs)在生成通用顺序代码方面表现出强大的能力,但在生成GPU代码时却面临重大挑战,原因是高质量标记训练数据的稀缺、合成解决方案生成中的编译器偏见以及跨硬件世代泛化的能力有限。这些因素阻碍了监督微调(SFT)成为改进当前LLM的有效可扩展方法。相比之下,强化学习(RL)提供了一种数据效率高且适应性强的替代方案,但需要访问相关工具、仔细选择训练问题,并建立一个稳健的评估环境。 我们提出了Makora的环境和工具,用于前沿模型的强化学习微调,并报告了在Triton代码生成方面对GPT-5进行微调的结果。在单次尝试设置下,我们的微调模型将内核正确性从43.7%提高到77.0%(+33.3个百分点),并将优于PyTorch TorchInductor的问题比例从14.8%提升至21.8%(+7个百分点),与基线GPT-5相比,同时在KernelBench上超过了之前的最先进模型。当集成到完整的编程代理中时,它可以解决扩展后的KernelBench套件中的97.4%的问题,并且在72.9%的问题上优于PyTorch TorchInductor编译器,几何平均速度提升达到了2.12倍。 我们的工作证明了,在数据可用性限制传统监督学习能力的高度专业化技术领域中,使用强化学习进行有针对性的后训练能够解锁LLM的能力。这为AI辅助加速器编程开辟了新的途径。
https://arxiv.org/abs/2602.11000
Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.
大型扩散模型,如FLUX(120亿参数)和Stable Diffusion 3(80亿参数),需要多GPU并行处理来实现高效的推理。统一序列并行化(USP),结合了奥德修斯(Ulysses)机制和环注意力(Ring attention)机制,在分布式注意力计算中已成为最先进方法。然而,现有的USP实施存在明显的效率问题,包括过多的内核启动开销以及计算-通信调度不佳。 为了解决这些问题,本文提出了一种多层次优化框架——**FastUSP**,该框架结合了编译级别优化(使用CUDA Graphs进行图编译和计算-通信重排),通信级别优化(FP8量化集体通信)及操作级别优化(采用双缓冲的环注意力流水线)。我们在FLUX(120亿参数)和Qwen-Image模型上对FastUSP进行了评估,使用的GPU配置包括2、4和8片NVIDIA RTX 5090。 在针对FLUX的测试中,与基线USP相比,FastUSP获得了1.12倍至1.16倍的端到端加速效果。其中编译级别优化是主要贡献者。而在Qwen-Image模型上,在使用2片GPU时,FastUSP实现了1.09倍的速度提升;在4到8片GPU配置下,我们发现环注意力与PyTorch Inductor之间的兼容性问题阻止了编译优化的实现,而基线USP则可以扩展至2-GPU性能的1.30倍至1.46倍。 此外,本文还对分布式扩散推理的性能特征进行了详细分析,揭示出现代高带宽GPU互连中的内核启动开销——而非通信延迟——成为主要瓶颈的事实。
https://arxiv.org/abs/2602.10940
Multi-task policy search is a challenging problem because policies are required to generalize beyond training cases. Curriculum learning has proven to be effective in this setting, as it introduces complexity progressively. However, designing effective curricula is labor-intensive and requires extensive domain expertise. LLM-based curriculum generation has only recently emerged as a potential solution, but was limited to operate in static, offline modes without leveraging real-time feedback from the optimizer. Here we propose an interactive LLM-assisted framework for online curriculum generation, where the LLM adaptively designs training cases based on real-time feedback from the evolutionary optimization process. We investigate how different feedback modalities, ranging from numeric metrics alone to combinations with plots and behavior visualizations, influence the LLM ability to generate meaningful curricula. Through a 2D robot navigation case study, tackled with genetic programming as optimizer, we evaluate our approach against static LLM-generated curricula and expert-designed baselines. We show that interactive curriculum generation outperforms static approaches, with multimodal feedback incorporating both progression plots and behavior visualizations yielding performance competitive with expert-designed curricula. This work contributes to understanding how LLMs can serve as interactive curriculum designers for embodied AI systems, with potential extensions to broader evolutionary robotics applications.
https://arxiv.org/abs/2602.10891
Recent advances in Neural Combinatorial Optimization (NCO) have been dominated by diffusion models that treat the Euclidean Traveling Salesman Problem (TSP) as a stochastic $N \times N$ heatmap generation task. In this paper, we propose CycFlow, a framework that replaces iterative edge denoising with deterministic point transport. CycFlow learns an instance-conditioned vector field that continuously transports input 2D coordinates to a canonical circular arrangement, where the optimal tour is recovered from this $2N$ dimensional representation via angular sorting. By leveraging data-dependent flow matching, we bypass the quadratic bottleneck of edge scoring in favor of linear coordinate dynamics. This paradigm shift accelerates solving speed by up to three orders of magnitude compared to state-of-the-art diffusion baselines, while maintaining competitive optimality gaps.
最近的神经组合优化(NCO)进展主要由扩散模型主导,这些模型将欧几里得旅行商问题(TSP)视为一个随机 $N \times N$ 热图生成任务。在本文中,我们提出了CycFlow框架,该框架用确定性的点传输替代了迭代边去噪过程。CycFlow 学习了一个实例条件下的向量场,这个向量场可以连续地将输入的二维坐标转换为一个规范化的圆形排列,在这种 $2N$ 维表示中,最优路径可以通过角度排序恢复出来。通过利用数据依赖的流匹配技术,我们避开了边评分中的二次瓶颈问题,转而采用线性坐标动态方法。这一范式转变使得求解速度相比最先进的扩散基准模型提升了三个数量级,同时保持了竞争性的优化差距。
https://arxiv.org/abs/2602.10794
Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.
软件漏洞检测(SVD)是现代系统中一个关键的挑战。大型语言模型(LLMs)能够提供自然语言解释与预测,但大多数研究工作集中在二元评估上,并且生成的解释通常缺乏与通用弱点枚举(CWE)类别的语义一致性。我们提出了VulReaD方法,这是一种知识图谱引导的漏洞推理和检测方法,超越了二元分类走向基于CWE级别的推理。VulReaD利用安全知识图谱(KG)作为语义骨干,并采用一个强大的教师LLM生成与CWE一致的对比推理监督,从而在没有人工注释的情况下训练学生模型。通过使用比值偏好优化(ORPO),对学生模型进行微调以鼓励符合分类法的推理并抑制无根据的解释。在三个真实世界的数据集中,VulReaD相比于最先进的基线技术,在二元F1评分上提高了8-10%,多类分类上的Macro-F1和Micro-F1分别提升了30%和18%。 结果表明,LLMs在二元检测方面优于深度学习基线模型,并且知识图谱引导的推理能够增强CWE覆盖率及解释性。
https://arxiv.org/abs/2602.10787
The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.
扩散模型和基于流的生成模型中的慢速迭代采样特性仍然是其实用部署的主要瓶颈。虽然一致性模型(CMs)代表了高效生成的一种最先进的蒸馏方法,但它们的大规模应用仍受到两个关键问题的限制:训练不稳定性和采样灵活性不足。现有方法通过架构调整或正则化目标来缓解这些问题,却忽略了对轨迹选择这一核心依赖性的忽视。 在本研究中,我们首先分析了这两个局限性:训练不稳定性源于由不稳定自监督项引发的损失发散;而采样的非弹性来自误差累积。基于这些见解和分析,我们提出了双端一致性模型(DE-CM),该模型通过选择关键子轨迹簇来实现稳定且有效的训练。 DE-CM将PF-ODE轨迹分解为三个优化目标的关键子轨迹。具体来说,我们的方法利用连续时间CMs目标以实现少量步骤的蒸馏,并采用流匹配作为边界正则化器来稳定训练过程。此外,我们提出了一种新颖的噪声到噪点(N2N)映射技术,可以将噪声映射到任意位置,从而减轻第一步中的误差累积问题。 广泛的实验结果展示了我们的方法的有效性:在ImageNet 256x256数据集上进行一步生成时,DE-CM实现了业界领先的FID评分为1.70,超越了现有的一步CM方法。
https://arxiv.org/abs/2602.10764
Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.
深度强化学习中的迁移学习通常由改进的稳定性以及降低训练成本所驱动,但在显著的任务域变化下可能会失败。本文通过一项受控实证研究,探讨了双层深度Q网络(DDQN)和斗士DQN之间架构差异如何影响跨环境中的迁移行为。使用CartPole作为源任务,并将LunarLander作为一个结构上明显不同的目标任务,在相同的超参数和训练条件下评估了一个固定的逐层表示迁移协议。我们还利用从头开始训练的基线智能体来定位迁移效果。实证结果表明,DDQN在观察到的设置下始终避免了负向迁移,并且其学习动态与目标环境中的基线性能相当。相比之下,在相同的条件下,斗士DQN则持续表现出负面迁移,表现为奖励下降和优化行为不稳定。多个随机种子上的统计分析确认了转移过程中的显著性能差距。这些发现表明,在所研究的迁移协议下,基于价值的深度强化学习中架构性的归纳偏差与跨环境迁移时的鲁棒性之间存在强相关性。
https://arxiv.org/abs/2602.09810
The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.
从稀疏或有限角度几何结构重建X射线CT图像是一项极具挑战性的任务。数据的不足通常会导致重建图像中出现伪影,甚至可能导致物体失真。因此,在这种情况下使用深度生成模型具有很大的兴趣和潜在的成功可能性。在深度生成先验(DGP)框架内,基于扩散的生成模型与迭代优化算法相结合,用于从稀疏几何结构获取的投影数据(即正弦图)中重建CT图像,以保持基于模型的方法的可解释性,同时引入神经网络的生成能力。因此,在这些框架内部有许多方面可以进一步研究以提高重建质量,例如图像生成、所使用的模型以及解决最小化问题时采用的迭代算法,我们提议对现有方法进行改进。即使在高度稀疏的几何结构下获得的结果也非常有前景,尽管显然需要在这个方向上开展更多的研究工作。
https://arxiv.org/abs/2602.10722
Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
通过自回归模型进行生成式推荐已经将检索和排名统一为单一的条件生成框架。然而,使用强化学习(RL)对这些模型进行微调时常会遇到根本的概率-奖励不匹配问题。传统的可能性主导解码方法(如束搜索)倾向于局部概率前缀,并表现出一种短视偏差,这导致两个关键失败:(1) 探索不足,其中低概率分支中的高回报项被过早剪枝且极少被采样;(2) 优势压缩,在具有高概率前缀的轨迹之间共享的高度相关的奖励会导致组内方差小,从而为RL提供弱比较信号。 为了应对这些挑战,我们提出了V-STAR框架,这是一种基于价值引导抽样和树状结构优势强化的学习方法。V-STAR通过两个协同组件形成自我演化的循环: 1. 首先开发了价值导向高效解码(VED),用以识别决定性节点,并选择性地加深具有高潜力的前缀。这种方法在不进行完全遍历搜索的情况下提高了探索效率。 2. 其次,我们提出了兄弟优势GRPO方法,利用生成的树结构来计算相对于兄弟的优势并集中学习信号到决定性的分支决策上。 广泛的离线和在线数据集实验表明,V-STAR优于最先进的基准模型,在严格的延迟限制下提供了更高的准确性和候选项多样性。
https://arxiv.org/abs/2602.10699
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at this https URL
训练稳定性仍然是在大型语言模型(LLM)中进行强化学习(RL)的核心挑战。策略陈旧、异步训练以及训练和推理引擎之间的不匹配都会导致行为策略与当前策略产生偏差,从而有训练崩溃的风险。重要性采样提供了一种针对这种分布变化的原则性修正方法,但其方差较高;现有的缓解措施,如令牌级裁剪和序列级归一化,则缺乏统一的理论基础。 我们提出了变分序列软策略优化(VESPO)。通过在提议分布上引入方差缩减的变分形式化,VESPO 衍生出一个闭式重塑核,该核直接作用于序列级别的重要性权重,而不需要进行长度归一化。实验证明,在数学推理基准测试中,VESPO 能够在陈旧比高达64倍和完全异步执行的情况下保持稳定的训练,并且无论是在密集模型还是专家混合模型上都能持续带来改进。 代码可在提供的链接处获取:[请在此处插入实际的URL]
https://arxiv.org/abs/2602.10693
Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.
现有的伪造检测方法通常局限于单模态或双模态设置,无法处理真实世界中普遍存在且交错出现的文本、图像和视频中的错误信息。为了解决这一问题,本文旨在开发一个统一框架,用于全方位的视觉语言伪造检测和定位(grounding)。在这个统一设置下,不同模态之间的交互以及同时进行检测和定位所面临的双重需求导致了一个关键性的“难度偏差”问题:简单的真伪分类任务倾向于主导梯度,从而在多任务优化过程中导致精细定位表现不佳。为了解决这一挑战,我们提出了**OmniVL-Guard**,这是一个平衡的强化学习框架,用于全方位的视觉语言伪造检测和定位。 具体而言,OmniVL-Guard 包含两个核心设计:自我演化的CoT生成(Self-Evolving CoT Generation)和自适应奖励缩放策略优化(Adaptive Reward Scaling Policy Optimization, ARSPO)。自我演化的CoT生成合成高质量的推理路径,有效克服了冷启动挑战。在此基础上,自适应奖励缩放策略优化(ARSPO)动态调节奖励尺度和任务权重,确保平衡的联合优化。 广泛的实验表明,OmniVL-Guard 显著优于现有方法,并在跨域场景中展示了零样本鲁棒泛化性能。
https://arxiv.org/abs/2602.10687
To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.
为了开发具有社会智能的AI,现有的方法通常孤立地建模人类行为维度(例如情感、认知或社交属性)。虽然这些特定任务的方法很有用,但它们往往增加了训练成本,并限制了不同行为设置之间的泛化能力。最近基于推理的强化学习(RL)方法能够在一个统一模型上进行多任务训练,但这并未明确解决在不同异构行为数据之间学习的问题。 为了解决这一缺口,我们引入了一种新的强化学习方法——异质性感知相对策略优化(Heterogeneity-Aware Relative Policy Optimization, HARPO)。这种方法通过调整优势函数来确保在整个策略优化过程中没有单一任务或样本具有过大的影响,从而在不同异构任务和样本之间实现平衡的学习。 使用HARPO,我们开发并发布了一种名为Omnisapiens-7B 2.0的基础模型,该模型专门用于社会行为处理。与现有的基于行为的基础模型相比,Omnisapiens-7B 2.0在多任务和保留样本设置中分别取得了高达+16.85%和+9.37%的性能提升,并且能够生成更为明确和稳健的行为轨迹。 此外,我们还验证了HARPO方法相对于最近的一些强化学习方法,在各种行为任务上的表现更加强劲一致。
https://arxiv.org/abs/2602.10635
Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
基于人类偏好的奖励模型是通过人类反馈强化学习对大型语言模型(LLM)进行校准的核心,然而它们通常由于噪音标注和诸如响应长度或风格等系统偏差而容易受到奖励操纵的威胁。为此,我们提出了贝叶斯非负奖励模型(BNRM),这是一种将非负因子分析与布拉德利-特里偏好模型相结合的原则性奖励建模框架。BNRM通过稀疏、非负隐变量生成过程来表示奖励,并在两个互补层次上运作:实例特定的隐变量诱导出解耦的奖励表示,而全局隐变量上的稀疏度则充当一种隐式去偏机制,抑制虚假相关性。这种先解耦后去偏结构使得BNRM能够在有不确定性的情况下进行稳健的学习。 为了将BNRM扩展到现代大型语言模型中,我们开发了一个基于深度模型表示条件下的近似变分推理网络,从而允许高效端到端训练。大量的实证结果表明,与强大的基准方法相比,BNRM在很大程度上缓解了奖励过度优化的问题,在分布变化下提高了鲁棒性,并提供了更具解释性的奖励分解。 这段文字描述了一种新型的贝叶斯非负奖励模型(BNRM),旨在解决大型语言模型通过人类反馈进行强化学习时所遇到的挑战。该方法不仅增强了对系统偏差和噪音的抵抗力,而且还能提高奖励模型的学习效率和可解释性,从而进一步优化了大型语言模型的行为表现和适应能力。
https://arxiv.org/abs/2602.10623
Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
对于大型语言模型的强化学习而言,高方差的令牌级别重要性采样(IS)比率会导致策略优化在大规模时变得不稳定。为了提高稳定性,近期的方法通常会为序列中的所有令牌使用一个固定的序列级别的IS比率或者单独调整每个令牌的IS比率,从而忽略了序列中相邻令牌之间的时序脱策偏差问题。在这篇论文中,我们首先实证地发现局部脱策偏差在令牌级别上具有结构性的不一致性,这可能扭曲了相邻令牌之间的策略梯度更新,并导致训练崩溃。 为了解决这个问题,我们提出了在线因果卡尔曼滤波器(KPO)用于稳定且有效的策略优化。具体来说,我们将所需的IS比率建模为一个随着令牌变化而演化的潜在状态,并应用卡尔曼滤波器根据过去令牌的状态进行在线和自回归更新,而不考虑未来的令牌。经过这样的过滤处理之后的IS比率既能保留每个令牌级别的局部结构感知变动又能有效平滑噪声峰值,从而提供更加稳定且有效的策略更新。 在实验中,KPO方法在具有挑战性的数学推理数据集上取得了优于当前最佳方法的结果。
https://arxiv.org/abs/2602.10609