Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: this https URL.
长期对话记忆是基于大型语言模型的对话系统的核心能力,然而现有的基准和评估协议主要集中在表面级别的事实性回忆。在实际互动中,适当的回应往往依赖于隐含约束,如用户状态、目标或价值观等,并且这些不是之后会明确查询的信息。为了评估这种情景,我们引入了**LoCoMo-Plus**,这是一个用于评估存在提示与触发语义断开情况下的认知记忆能力的基准测试,模型必须在长时间对话上下文中保留并应用潜在约束条件。 此外,我们还展示了传统的字符串匹配指标和显式的任务类型指令与此类场景不相符合,并提出了一种基于约束一致性的统一评估框架。跨多种骨干模型、检索方法及内存系统的实验表明,在这种情况下认知记忆仍然具有挑战性,并揭示了现有基准测试未能捕捉到的失败案例。 我们的代码和评估框架可公开获取:[此链接](this https URL)。
https://arxiv.org/abs/2602.10715
Asset retrieval--finding similar assets in a financial universe--is central to quantitative investment decision-making. Existing approaches define similarity through historical price patterns or sector classifications, but such backward-looking criteria provide no guarantee about future behavior. We argue that effective asset retrieval should be future-aligned: the retrieved assets should be those most likely to exhibit correlated future returns. To this end, we propose Future-Aligned Soft Contrastive Learning (FASCL), a representation learning framework whose soft contrastive loss uses pairwise future return correlations as continuous supervision targets. We further introduce an evaluation protocol designed to directly assess whether retrieved assets share similar future trajectories. Experiments on 4,229 US equities demonstrate that FASCL consistently outperforms 13 baselines across all future-behavior metrics. The source code will be available soon.
资产检索——在金融宇宙中找到相似的资产——是量化投资决策的核心。现有方法通过历史价格模式或行业分类来定义相似性,但这些基于过去的指标无法保证对未来的准确预测。我们认为,有效的资产检索应该与未来保持一致:即所检索到的资产应该是那些最有可能在未来表现出相关收益走势的资产。为此,我们提出了未来导向的软对比学习(Future-Aligned Soft Contrastive Learning, FASCL),这是一种表示学习框架,其软对比损失函数使用成对未来的收益关联性作为连续监督目标。此外,我们还引入了一种评估协议,旨在直接检验检索到的资产是否在未来轨迹上表现出相似性。实验结果基于4229只美国股票显示,在所有未来行为指标上,FASCL始终优于13个基准方法。源代码将很快发布。
https://arxiv.org/abs/2602.10711
The task of graph-level anomaly detection (GLAD) is to identify anomalous graphs that deviate significantly from the majority of graphs in a dataset. While deep GLAD methods have shown promising performance, their black-box nature limits their reliability and deployment in real-world applications. Although some recent methods have made attempts to provide explanations for anomaly detection results, they either provide explanations without referencing normal graphs, or rely on abstract latent vectors as prototypes rather than concrete graphs from the dataset. To address these limitations, we propose Prototype-based Graph-Level Anomaly Detection (ProtoGLAD), an interpretable unsupervised framework that provides explanation for each detected anomaly by explicitly contrasting with its nearest normal prototype graph. It employs a point-set kernel to iteratively discover multiple normal prototype graphs and their associated clusters from the dataset, then identifying graphs distant from all discovered normal clusters as anomalies. Extensive experiments on multiple real-world datasets demonstrate that ProtoGLAD achieves competitive anomaly detection performance compared to state-of-the-art GLAD methods while providing better human-interpretable prototype-based explanations.
图级异常检测(Graph-Level Anomaly Detection,简称GLAD)的任务是识别与数据集中大多数图形显著不同的异常图形。虽然深度GLAD方法展示出了令人振奋的性能,但它们黑盒化的特性限制了其在实际应用中的可靠性和部署能力。尽管一些近期的方法尝试为异常检测结果提供解释,但是这些方法要么不参考正常图来提供解释,要么依赖抽象的潜在向量作为原型而不是来自数据集的具体图形。为了应对这些局限性,我们提出了基于原型的图级异常检测(Prototype-based Graph-Level Anomaly Detection,简称ProtoGLAD),这是一种可解释的无监督框架,通过明确地与最近的正常原型图进行对比来为每个检测到的异常提供解释。它采用点集核函数以迭代方式从数据集中发现多个正常的原型图形及其相关聚类,然后识别远离所有已知正常集群的图形作为异常。在多个真实世界的数据集上进行了广泛的实验表明,ProtoGLAD与最先进的GLAD方法相比,在异常检测性能方面达到了竞争水平,并且提供了更好的基于原型的人类可解释性说明。
https://arxiv.org/abs/2602.10708
Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
通过自回归模型进行生成式推荐已经将检索和排名统一为单一的条件生成框架。然而,使用强化学习(RL)对这些模型进行微调时常会遇到根本的概率-奖励不匹配问题。传统的可能性主导解码方法(如束搜索)倾向于局部概率前缀,并表现出一种短视偏差,这导致两个关键失败:(1) 探索不足,其中低概率分支中的高回报项被过早剪枝且极少被采样;(2) 优势压缩,在具有高概率前缀的轨迹之间共享的高度相关的奖励会导致组内方差小,从而为RL提供弱比较信号。 为了应对这些挑战,我们提出了V-STAR框架,这是一种基于价值引导抽样和树状结构优势强化的学习方法。V-STAR通过两个协同组件形成自我演化的循环: 1. 首先开发了价值导向高效解码(VED),用以识别决定性节点,并选择性地加深具有高潜力的前缀。这种方法在不进行完全遍历搜索的情况下提高了探索效率。 2. 其次,我们提出了兄弟优势GRPO方法,利用生成的树结构来计算相对于兄弟的优势并集中学习信号到决定性的分支决策上。 广泛的离线和在线数据集实验表明,V-STAR优于最先进的基准模型,在严格的延迟限制下提供了更高的准确性和候选项多样性。
https://arxiv.org/abs/2602.10699
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.
翻译如下: 基于视觉-语言-行动(VLA)模型最近在机器人感知和控制方面取得了显著的进展,但大多数现有的方法主要依赖于使用二维图像训练的视觉-语言模型(VLM),这限制了它们对复杂三维环境的空间理解和动作定位。为了解决这一局限性,我们提出了一种新颖的框架,该框架将深度估计集成到VLA模型中,以丰富三维特征表示。具体而言,我们采用一种称为VGGT的深度估计基线方法从标准RGB输入中提取具有几何感知能力的三维线索,从而在利用现有大规模二维数据集的同时隐式恢复了三维结构信息。为了进一步增强这些由深度驱动的特征的可靠性,我们引入了一个名为动作助手的新模块,该模块通过使用动作先验来约束学习到的三维表示,并确保其与下游控制任务的一致性。通过将增强后的三维特征与传统的二维视觉标记融合,我们的方法显著提高了VLA模型的泛化能力和鲁棒性。实验结果表明,所提出的方法不仅在几何上模棱两可的情况下增强了感知能力,还导致了更优的动作预测精度。这项工作强调了深度驱动的数据增强和辅助专家监督对于弥合二维观察与机器人系统中三维感知决策之间的差距的潜力。
https://arxiv.org/abs/2602.10698
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at this https URL
训练稳定性仍然是在大型语言模型(LLM)中进行强化学习(RL)的核心挑战。策略陈旧、异步训练以及训练和推理引擎之间的不匹配都会导致行为策略与当前策略产生偏差,从而有训练崩溃的风险。重要性采样提供了一种针对这种分布变化的原则性修正方法,但其方差较高;现有的缓解措施,如令牌级裁剪和序列级归一化,则缺乏统一的理论基础。 我们提出了变分序列软策略优化(VESPO)。通过在提议分布上引入方差缩减的变分形式化,VESPO 衍生出一个闭式重塑核,该核直接作用于序列级别的重要性权重,而不需要进行长度归一化。实验证明,在数学推理基准测试中,VESPO 能够在陈旧比高达64倍和完全异步执行的情况下保持稳定的训练,并且无论是在密集模型还是专家混合模型上都能持续带来改进。 代码可在提供的链接处获取:[请在此处插入实际的URL]
https://arxiv.org/abs/2602.10693
Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.
现有的伪造检测方法通常局限于单模态或双模态设置,无法处理真实世界中普遍存在且交错出现的文本、图像和视频中的错误信息。为了解决这一问题,本文旨在开发一个统一框架,用于全方位的视觉语言伪造检测和定位(grounding)。在这个统一设置下,不同模态之间的交互以及同时进行检测和定位所面临的双重需求导致了一个关键性的“难度偏差”问题:简单的真伪分类任务倾向于主导梯度,从而在多任务优化过程中导致精细定位表现不佳。为了解决这一挑战,我们提出了**OmniVL-Guard**,这是一个平衡的强化学习框架,用于全方位的视觉语言伪造检测和定位。 具体而言,OmniVL-Guard 包含两个核心设计:自我演化的CoT生成(Self-Evolving CoT Generation)和自适应奖励缩放策略优化(Adaptive Reward Scaling Policy Optimization, ARSPO)。自我演化的CoT生成合成高质量的推理路径,有效克服了冷启动挑战。在此基础上,自适应奖励缩放策略优化(ARSPO)动态调节奖励尺度和任务权重,确保平衡的联合优化。 广泛的实验表明,OmniVL-Guard 显著优于现有方法,并在跨域场景中展示了零样本鲁棒泛化性能。
https://arxiv.org/abs/2602.10687
Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at this https URL.
视觉链式思考(VCoT)作为一种增强多模态推理的方法,通过将视觉感知整合到中间的推理步骤中而崭露头角。然而,现有的VCoT方法主要局限于静态场景,并且难以捕捉执行指令、预测和摄像机运动等任务所需的时间动态特性。为解决这一问题,我们提出了一种名为TwiFF-2.7M的数据集,这是首个大规模的时间基础视觉链式思考数据集,包含来自270万个视频片段的素材,专门设计用于处理动态场景中的视觉问答问题。 与该数据集相配套的是TwiFF-Bench,这是一个由1,078个样本组成的高质量评估基准,旨在评估推理轨迹的真实性和最终答案的准确性,特别针对开放式动态设置进行优化。在此基础上,我们提出了TwiFF模型,这是一种统一模式,能够协同利用预训练视频生成和图像理解的能力来产生时间连贯的视觉推理线索——通过迭代地生成未来的动作帧和文本推理。 大量的实验结果表明,在动态推理任务上,TwiFF显著超越了现有的VCoT方法及基于文本链式思考的方法。这充分验证了其在处理动态场景中的视觉问答问题时的有效性。我们的代码与数据集可在上述网址获得。
https://arxiv.org/abs/2602.10675
To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.
为了开发具有社会智能的AI,现有的方法通常孤立地建模人类行为维度(例如情感、认知或社交属性)。虽然这些特定任务的方法很有用,但它们往往增加了训练成本,并限制了不同行为设置之间的泛化能力。最近基于推理的强化学习(RL)方法能够在一个统一模型上进行多任务训练,但这并未明确解决在不同异构行为数据之间学习的问题。 为了解决这一缺口,我们引入了一种新的强化学习方法——异质性感知相对策略优化(Heterogeneity-Aware Relative Policy Optimization, HARPO)。这种方法通过调整优势函数来确保在整个策略优化过程中没有单一任务或样本具有过大的影响,从而在不同异构任务和样本之间实现平衡的学习。 使用HARPO,我们开发并发布了一种名为Omnisapiens-7B 2.0的基础模型,该模型专门用于社会行为处理。与现有的基于行为的基础模型相比,Omnisapiens-7B 2.0在多任务和保留样本设置中分别取得了高达+16.85%和+9.37%的性能提升,并且能够生成更为明确和稳健的行为轨迹。 此外,我们还验证了HARPO方法相对于最近的一些强化学习方法,在各种行为任务上的表现更加强劲一致。
https://arxiv.org/abs/2602.10635
This white paper presents a critical synthesis of the recent breakthrough in nonuniformly elliptic regularity theory and the burgeoning field of neurosymbolic large reasoning models (LRMs). We explore the resolution of the long-standing sharp growth rate conjecture in Schauder theory, achieved by Cristiana De Filippis and Giuseppe Mingione, which identifies the exact threshold $q/p < 1 + \alpha/n$ for gradient Hölder continuity. Central to this mathematical achievement is the ``ghost equation'' methodology, a sophisticated auxiliary derivation that bypasses the non-differentiability of classical Euler-Lagrange systems. We propose that the next era of mathematical discovery lies in the integration of these pure analytical constructs with LRMs grounded in topos theory and formal verification frameworks such as Safe and Typed Chain-of-Thought (PC-CoT). By modeling the reasoning process as a categorical colimit in a slice topos, we demonstrate how LRMs can autonomously navigate the ``Dark Side'' of the calculus of variations, providing machine-checkable proofs for regularity bounds in complex, multi-phase physical systems.
这份白皮书对非均匀椭圆正则性理论的近期突破以及新兴的神经符号大型推理模型(LRMs)领域进行了批判性的综述。我们探讨了由Cristiana De Filippis和Giuseppe Mingione解决的Schauder理论中长期存在的尖锐增长速率猜想,该猜想确定了梯度Hölder连续性的精确阈值$q/p < 1 + \alpha/n$。这一数学成就的核心是“幽灵方程”方法论,这是一种复杂的辅助推导方式,能够绕过经典欧拉-拉格朗日系统的非可微性问题。 我们建议下一波的数学发现将在于整合这些纯粹分析构造与基于托普斯理论和形式验证框架(如Safe和Typed Chain-of-Thought (PC-CoT))构建的LRMs。通过将推理过程建模为切片托普斯中的范畴colimit,我们展示了LRMs如何自主地在变分法的“黑暗面”中导航,并为此类复杂多相物理系统提供可机器验证的正则性边界证明。
https://arxiv.org/abs/2602.10632
Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.
心智理论(ToM)评估模型是否能够推断隐藏的心理状态,如信念、愿望和意图,这对于自然的社交互动至关重要。尽管近年来大型推理模型(LRMs)在数学和编程等领域中的逐步推理能力得到了显著提升,但这些进步是否能转移到社会认知技能上仍然未被充分探索。我们对九种先进的大型语言模型进行了系统的研究,在三个典型的ToM基准测试中比较了推理模型与非推理模型的表现。 研究结果表明:推理模型并不总是优于非推理模型,并且有时表现更差。详细分析揭示出三大见解: 1. 慢性思考崩溃:随着回复长度的增加,准确性显著下降;而更大的推理预算则会损害性能。 2. 中等程度和适应性的推理有助于提高性能:限制推理长度可以缓解失败情况的发生;不同的成功模式表明动态调整是必要的。 3. 选项匹配捷径:当移除多项选择题的答案选项时,推理模型的表现明显提升,这表明它们依赖于选项匹配而非真正的逻辑推导。 我们还设计了两种干预方法来进一步验证和解决这些问题:慢到快(S2F)适应性推理以及思考至匹配(T2M)捷径预防。通过这些结果,我们的研究强调了LRMs在形式推理(如数学、代码等)方面的发展并不能完全转移到ToM任务上——这是社会推理中的典型任务。我们得出结论:要实现可靠的ToM能力,需要开发超越现有推理方法的独特技能。
https://arxiv.org/abs/2602.10625
Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.
医学基础模型在受控基准测试中显示出巨大潜力,但仍因依赖特定任务的微调而难以广泛部署。在此,我们介绍了DermFM-Zero,这是一种通过遮蔽潜在建模和对比学习训练而成的皮肤病视觉-语言基础模型,该模型基于超过400万个跨模态数据点进行训练。我们在包括零样本诊断和多模式检索在内的20个基准测试中评估了DermFM-Zero,在不进行特定任务适应的情况下取得了最先进的性能。此外,我们还在涉及1,100多名临床医生的三项跨国读者研究中对其零样本能力进行了进一步评估。 在初级保健环境中,人工智能辅助使得全科医生能够在98种皮肤病情况下几乎将其鉴别诊断准确率提高了一倍。在专科环境下,该模型在多模式皮肤癌评估方面显著超越了持牌皮肤科医生的表现。在协作工作流程中,人工智能辅助使非专业人士能够超过未受助的专业人士,并且提高了管理的适当性。 最后,我们展示了DermFM-Zero的潜在表示具有可解释性:无监督稀疏自编码器可以解开临床相关的概念,这些概念超过了预定义词汇的方法,并允许有针对性地抑制由人工制品引起的偏差,从而增强了鲁棒性而无需重新训练。这些发现表明,基础模型能够提供有效的、安全的和透明的零样本临床决策支持。
https://arxiv.org/abs/2602.10624
Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
基于人类偏好的奖励模型是通过人类反馈强化学习对大型语言模型(LLM)进行校准的核心,然而它们通常由于噪音标注和诸如响应长度或风格等系统偏差而容易受到奖励操纵的威胁。为此,我们提出了贝叶斯非负奖励模型(BNRM),这是一种将非负因子分析与布拉德利-特里偏好模型相结合的原则性奖励建模框架。BNRM通过稀疏、非负隐变量生成过程来表示奖励,并在两个互补层次上运作:实例特定的隐变量诱导出解耦的奖励表示,而全局隐变量上的稀疏度则充当一种隐式去偏机制,抑制虚假相关性。这种先解耦后去偏结构使得BNRM能够在有不确定性的情况下进行稳健的学习。 为了将BNRM扩展到现代大型语言模型中,我们开发了一个基于深度模型表示条件下的近似变分推理网络,从而允许高效端到端训练。大量的实证结果表明,与强大的基准方法相比,BNRM在很大程度上缓解了奖励过度优化的问题,在分布变化下提高了鲁棒性,并提供了更具解释性的奖励分解。 这段文字描述了一种新型的贝叶斯非负奖励模型(BNRM),旨在解决大型语言模型通过人类反馈进行强化学习时所遇到的挑战。该方法不仅增强了对系统偏差和噪音的抵抗力,而且还能提高奖励模型的学习效率和可解释性,从而进一步优化了大型语言模型的行为表现和适应能力。
https://arxiv.org/abs/2602.10623
Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
对于大型语言模型的强化学习而言,高方差的令牌级别重要性采样(IS)比率会导致策略优化在大规模时变得不稳定。为了提高稳定性,近期的方法通常会为序列中的所有令牌使用一个固定的序列级别的IS比率或者单独调整每个令牌的IS比率,从而忽略了序列中相邻令牌之间的时序脱策偏差问题。在这篇论文中,我们首先实证地发现局部脱策偏差在令牌级别上具有结构性的不一致性,这可能扭曲了相邻令牌之间的策略梯度更新,并导致训练崩溃。 为了解决这个问题,我们提出了在线因果卡尔曼滤波器(KPO)用于稳定且有效的策略优化。具体来说,我们将所需的IS比率建模为一个随着令牌变化而演化的潜在状态,并应用卡尔曼滤波器根据过去令牌的状态进行在线和自回归更新,而不考虑未来的令牌。经过这样的过滤处理之后的IS比率既能保留每个令牌级别的局部结构感知变动又能有效平滑噪声峰值,从而提供更加稳定且有效的策略更新。 在实验中,KPO方法在具有挑战性的数学推理数据集上取得了优于当前最佳方法的结果。
https://arxiv.org/abs/2602.10609
Zeroth-order (ZO) optimization has long been favored for its biological plausibility and its capacity to handle non-differentiable objectives, yet its computational complexity has historically limited its application in deep neural networks. Challenging the conventional paradigm that gradients propagate layer-by-layer, we propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network. We prove that HZO reduces the query complexity from $O(ML^2)$ to $O(ML \log L)$ for a network of width $M$ and depth $L$, representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit ($L_{lip} \approx 1$). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.
零阶(ZO)优化方法因其生物合理性以及处理非可微目标的能力而长期以来备受青睐,然而其计算复杂度在过去一直限制了它在深度神经网络中的应用。我们挑战传统的逐层传播梯度的范式,并提出了一种新的分治策略——层次化零阶(HZO)优化。这种策略通过分解网络的深度维度来实现。 我们证明了对于宽度为$M$、深度为$L$的网络,HZO将查询复杂性从$O(ML^2)$降低到$O(ML \log L)$,这标志着在现有的零阶方法中取得了显著的进步。此外,我们还提供了一项详细的误差分析,表明HZO通过在接近单位极限($L_{lip} \approx 1$)的操作保持了数值稳定性。 广泛的实验评估显示,在CIFAR-10和ImageNet数据集上,HZO实现了与反向传播相当的准确性。这项工作打破了对零阶优化方法无法应用于深度网络的传统观点,并为非可微或不可区分目标的学习提供了一种新的有效途径。
https://arxiv.org/abs/2602.10607
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
我们引入了Step 3.5 Flash,这是一种稀疏的专家混合模型(MoE),它在前沿级别的代理智能和计算效率之间架起了桥梁。我们的重点在于构建代理时最重要的两个方面:敏锐的推理能力和快速、可靠的执行能力。Step 3.5 Flash 结合了一个基础的1960亿参数模型与110亿个活跃参数,以实现高效的推断过程。它通过交错使用3:1滑动窗口/全注意力机制和多令牌预测(MTP-3)进行优化,从而减少多轮代理交互中的延迟和成本。 为了达到前沿级别的智能,我们设计了一个可扩展的强化学习框架,该框架结合了可验证信号与偏好反馈,并且能够在大规模离策略训练下保持稳定,使得在数学、代码和工具使用方面能够持续自我改进。Step 3.5 Flash 在代理任务、编程任务和数学任务中表现出色,在IMO-AnswerBench上得分为85.4%,LiveCodeBench-v6(2024.08-2025.05)得分为86.4%,tau2-Bench得分高达88.2%,在BrowseComp(带上下文管理)任务中得分为69.0%,以及在Terminal-Bench 2.0中的成绩为51.0%。这些结果与前沿模型如GPT-5.2 xHigh和Gemini 3.0 Pro相当。 通过重新定义效率边界,Step 3.5 Flash 为在现实世界工业环境中部署复杂代理提供了一个高密度的基础框架。
https://arxiv.org/abs/2602.10604
Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.
深度强化学习(DRL)在训练和执行过程中可能会探索不可行的动作。现有方法通常假设存在一种符号接地函数,该函数将高维状态映射到一致的符号表示,并且还使用手动指定的动作屏蔽技术来约束动作。本文中,我们提出了一种新颖的框架——神经符号行动掩码(NSAM),它能够以最少监督的方式,在DRL过程中自动学习与给定领域内的高维状态约束相一致的符号模型。基于学到的状态符号模型,NSAM 学习出一种规则,用以排除不可行动作。 NSAM 使得符号推理和深度策略优化之间的端到端集成成为可能,其中符号接地和策略学习方面的改进相互强化。我们在多个具有约束条件的领域中对 NSAM 进行了评估,并通过实验结果证明,NSAM 显著提高了 DRL 代理人的样本效率,并且大幅减少了违反约束的情况发生次数。
https://arxiv.org/abs/2602.10598
The trade-off between interpretability and accuracy remains a core challenge in machine learning. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are often constrained by their strictly additive nature, which can limit predictive performance. Introducing feature interactions can boost accuracy yet may obscure individual feature contributions. To address these issues, we propose Neural Additive Experts (NAEs), a novel framework that seamlessly balances interpretability and accuracy. NAEs employ a mixture of experts framework, learning multiple specialized networks per feature, while a dynamic gating mechanism integrates information across features, thereby relaxing rigid additive constraints. Furthermore, we propose targeted regularization techniques to mitigate variance among expert predictions, facilitating a smooth transition from an exclusively additive model to one that captures intricate feature interactions while maintaining clarity in feature attributions. Our theoretical analysis and experiments on synthetic data illustrate the model's flexibility, and extensive evaluations on real-world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature-level explanations. The code is available at this https URL.
在机器学习领域,可解释性和准确性的权衡仍然是一个核心挑战。标准广义加性模型(GAMs)提供了清晰的特征归因,但它们通常受到严格加性性质的限制,这可能会影响预测性能。引入特征交互可以提高准确性,但却可能会使单个特征贡献变得模糊不清。为了解决这些问题,我们提出了神经加性专家(NAEs),这是一种新型框架,能够无缝地平衡可解释性和准确性。NAEs采用混合专家模型架构,为每个特征学习多个专业化的网络,并通过动态门控机制整合跨特征的信息,从而放松了严格的加性约束条件。此外,我们还提出针对性的正则化技术以减少各专家预测之间的方差,使得从纯粹加性的模型平滑过渡到能够捕捉复杂特征交互的同时保持特征归因清晰度成为可能。我们的理论分析和合成数据上的实验展示了该模型的灵活性,而对真实世界数据集的广泛评估证实NAEs在预测准确性和透明、特征级别的解释之间实现了最佳平衡。代码可以在[此链接](https://example.com/code)获取。
https://arxiv.org/abs/2602.10585
Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a tree-structured state space when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the directed acyclic graph (DAG) state space. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose Flow of SpanS (FOSS), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5% over Transformer on text generation and achieves 3.5% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.
标准的自回归语言模型从固定词汇表中逐词生成文本,当把单词采样视为一个动作时,这种做法会产生树状结构的状态空间。这样的方法限制了灵活性和表达能力。最近的研究通过抽取检索到的文本片段来引入动态词汇表,但忽略了同一个句子可以由不同长度的片段组成这一事实,并且没有显式地对有向无环图(DAG)状态空间进行建模。这导致了组合路径探索受限并偏向于所选路径。 生成流网络(GFlowNets)对于有效探索和泛化处理具有DAG结构的状态空间非常有用。然而,基于GFlowNet的语言模型先前仅在词级别上操作,并且仍然局限于树状结构的空间中工作,从而限制了它们的潜力。在这项工作中,我们提出了SpanS流(FOSS),这是一种用于片段生成的原则性GFlowNets框架。FoSS通过灵活地分割检索到的文本构建动态片段词汇表,确保了DAG结构的状态空间,使GFlowNets能够探索多种组合路径并提高泛化能力。 借助专门设计的奖励模型,FoSS可以产生多样且高质量的文本。实证研究表明,在文本生成方面,FoSS相比Transformer提高了高达12.5%的MAUVE分数,并在知识密集型任务上取得了3.5%的改进,持续优于当前最先进的方法。规模实验进一步表明,随着模型更大、数据更多以及检索语料库更丰富时,FoSS能够保持其相对于强基线的优势。
https://arxiv.org/abs/2602.10583
Symbolic regression aims to distill mathematical equations from observational data. Recent approaches have successfully leveraged Large Language Models (LLMs) to generate equation hypotheses, capitalizing on their vast pre-trained scientific priors. However, existing frameworks predominantly treat the LLM as a static generator, relying on prompt-level guidance to steer exploration. This paradigm fails to update the model's internal representations based on search feedback, often yielding physically inconsistent or mathematically redundant expressions. In this work, we propose PiT-PO (Physics-informed Token-regularized Policy Optimization), a unified framework that evolves the LLM into an adaptive generator via reinforcement learning. Central to PiT-PO is a dual-constraint mechanism that rigorously enforces hierarchical physical validity while simultaneously applying fine-grained, token-level penalties to suppress redundant structures. Consequently, PiT-PO aligns LLM to produce equations that are both scientifically consistent and structurally parsimonious. Empirically, PiT-PO achieves state-of-the-art performance on standard benchmarks and successfully discovers novel turbulence models for challenging fluid dynamics problems. We also demonstrate that PiT-PO empowers small-scale models to outperform closed-source giants, democratizing access to high-performance scientific discovery.
符号回归的目标是从观测数据中提炼出数学方程。最近的方法已经成功地利用大型语言模型(LLMs)生成方程假设,这些方法依赖于它们丰富的预训练科学先验知识。然而,现有的框架大多将LLM视为静态生成器,依赖于提示级别的指导来引导探索过程。这种范式无法根据搜索反馈更新模型的内部表示,通常会产生物理上不一致或数学上冗余的表达式。 在这项工作中,我们提出了PiT-PO(基于物理学信息和标记惩罚的策略优化),这是一个通过强化学习将LLM演变为自适应生成器的统一框架。PiT-PO的核心是一个双重约束机制,它严格地强制执行分层物理有效性的同时,在标记级别应用细致的惩罚来抑制冗余结构。因此,PiT-PO使LLM能够产生既科学上一致又在结构上简洁的方程。 从实证角度看,PiT-PO在标准基准测试中达到了最先进的性能,并成功发现了适用于复杂流体动力学问题的新湍流模型。此外,我们还证明了PiT-PO可以使小型模型超越封闭源代码的大规模模型,在高性能科学发现方面实现民主化访问。
https://arxiv.org/abs/2602.10576