Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{this https URL}{this https URL}$.
统一多模态模型(UMMs)在视觉生成方面取得了显著进展。然而,现有的基准测试主要评估的是**结晶智力(Crystallized Intelligence)**,这依赖于回忆积累的知识和学习的模式。这种关注忽视了**生成流体智力(Generative Fluid Intelligence, GFI)**: 即根据现有约束条件诱导模式、推理以及即兴适应新场景的能力。为了严格评估这一能力,我们引入了一个名为**GENIUS**(Generative Fluid Intelligence Evaluation Suite)的新框架。我们将GFI定义为三个基本要素的合成: 1. **隐式模式诱导(Inducing Implicit Patterns)**:例如推断个性化视觉偏好。 2. **即兴约束执行(Executing Ad-hoc Constraints)**:例如可视化抽象比喻。 3. **适应上下文知识(Adapting to Contextual Knowledge)**:例如模拟反直观的物理现象。 这些基本要素共同挑战模型解决完全基于即时情境的问题。我们对12个代表性模型进行了系统的评估,发现它们在这类任务中表现出显著的能力缺陷。尤为重要的是,我们的诊断分析明确区分了导致这些问题的原因,并表明这些能力缺陷源自于上下文理解不足而非内在生成能力的缺乏。 为了解决这一差距,我们提出了一种无需训练的关注干预策略。最终,**GENIUS**建立了一个严格的评估标准来衡量GFI,指导该领域从知识利用走向动态、通用性的推理发展。我们的数据集和代码将在[此链接](https://this.https.URL/)发布。 --- 请注意,上述翻译中包含的“this https URL”应替换为实际发布的网址或适当格式的链接地址。
https://arxiv.org/abs/2602.11144
Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.
错误信息检测是一项关键任务,可以通过整合外部知识得到显著改进,就像手动事实核查一样。在本工作中,我们提出了一种新颖的方法来表示文本文档,以便从知识库中引入信息。我们的方法称为“基于图的文本编码”(Text Encoding with Graph, TEG),通过提取结构化的图形形式的信息并同时对文本和图形进行编码,以用于分类目的,从而处理文档。通过广泛的实验,我们证明了这种混合表示在错误信息检测性能上比单独使用语言模型要更加优越。此外,我们还介绍了TEGRA框架的扩展版本,该版本整合了特定领域的知识,在大多数情况下进一步提高了分类准确性。
https://arxiv.org/abs/2602.11106
Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
代理式编码要求代理能够有效地与运行时环境(例如命令行界面 CLI)进行交互,以完成诸如解决依赖问题、修复系统问题等任务。然而,如何在大规模上获取此类环境密集型任务以增强代理的能力仍然研究不足。为了解决这个问题,我们基于 Dockerfile 和代理任务之间的类比,提出让代理模拟和探索环境历史,并通过执行反馈进行指导的方法。通过追踪健康环境的历史记录,可以将其状态回溯到带有运行时故障的早期状态,在该状态下可以通过打包错误的状态及其相应的错误消息来派生出一项任务。 使用这种方法,我们构建了一个名为 CLI-Gym 的系统,总共衍生出了 1,655 个环境密集型任务,这是同类集合中最大的。此外,通过精心挑选的成功轨迹,我们的微调模型 LiberCoder 在 Terminal-Bench 上取得了绝对改进 +21.1%(达到 46.1%)的显著提升,优于多种强大的基线方法。据我们所知,这是第一个用于大规模导出环境密集型任务的公开流水线。
https://arxiv.org/abs/2602.10999
Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre-trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical "structural resolution mismatch," failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross-Modal Synergy Expert via relation-aware cross-attention and semantic-intent-driven gating. By coupling a density-dependent Identity Anchoring strategy with a Double-tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non-homogeneous structured data.
知识图谱补全(KGC)本质上依赖于预训练实体语义与异构拓扑结构的连贯融合,以促进稳健的关系推理。然而,现有的范式遇到了一个关键性的“结构性分辨率不匹配”问题,无法调和不同图形密度下不同的表示需求,这导致在密集簇中产生结构性噪音干扰,在稀疏区域则出现灾难性表示崩溃。我们提出了SynergyKGC,这是一个自适应框架,它将传统的邻居聚合推进到具有关系感知交叉注意力和语义意图驱动门控的主动跨模态协同专家。 通过结合依赖密度的身份锚定策略与双塔一致性架构,SynergyKGC有效地解决了拓扑异质性,并确保了训练和推理阶段中的表示稳定性。在两个公开基准上的系统评估验证了我们方法在显著提升KGC命中率方面的优越性,为非同质结构化数据中稳健信息整合的通用原则提供了实证依据。
https://arxiv.org/abs/2602.10845
Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
稠密检索是一种在开放领域自然语言处理任务中获取相关上下文或世界知识的有前途的方法,如今广泛应用于信息检索应用。然而,近期报告指出,在生成文本时存在对大型语言模型(LLMs)产生的文本的一种普遍偏好。这种偏差被称为“来源偏差”,有人假设较低的困惑度(perplexity)会导致这一效果。在这项研究中,我们通过进行受控评估来重新审视这一说法,追踪训练阶段和数据源之间出现此类偏好的情况。使用平行的人类生成及大型语言模型生成的SciFact和Natural Questions (NQ320K) 数据集副本,我们将无监督检查点与在领域内人类文本、领域内的大型语言模型生成文本以及MS MARCO上进行微调后的模型进行了对比。我们的研究结果如下: 1. 未经过训练的检索器并不表现出一致地偏好于大型语言模型产生的内容。这种偏好的方向和程度取决于数据集。 2. 在所有测试设置中,使用MS MARCO进行监督微调始终会将排名偏向于由大型语言模型生成的内容。 3. 领域内微调会产生具有特定领域特征且不一致的偏好转变。 4. 使用大型语言模型生成语料库进行微调会导致明显的偏好于大型语言模型产生的内容。 最后,一个以检索器为中心的困惑度探针实验表明,在将语言建模头重新附加到经过微调的稠密检索编码器之后,其对相关性的评估接近随机水平,从而削弱了困惑度作为解释因素的作用。我们的研究证明来源偏差是一个由训练引起的现像,而不是稠密检索器固有的属性。
https://arxiv.org/abs/2602.10833
Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs' lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model' internal knowledge state, enabling the quantification and expression of the black-box LLM' knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.
大型语言模型(LLM)已经取得了显著的成功,然而内容生成扭曲(幻觉)的出现限制了它们的实际应用。幻觉的核心原因在于LLM缺乏对其内部知识的认知,无法表达超出其内部知识边界的问题上的知识状态,而这正是人类可以做到的。然而,目前关于知识边界表达的研究主要集中在白盒LLM上,而对于提供仅API访问且不透露内部参数的黑盒LLM,相关方法研究尚处于起步阶段。 在此背景下,本文提出了一种基于深度学习的方法——LSCL(LLM监督置信度学习),用于表达黑盒LLM的知识边界。该方法在知识蒸馏框架的基础上设计了一个深度学习模型,以输入问题、输出答案和来自黑盒LLM的标记概率作为输入,构建了输入与模型内部知识状态之间的映射关系,从而实现了对黑盒LLM知识边界的量化和表达。 实验结果表明,在多种公共数据集上以及多个突出的黑盒LLM上进行测试时,LSCL能够有效帮助黑盒LLM准确地表达其知识边界,并在诸如精度和召回率等指标上显著优于现有的基线模型。此外,对于不支持访问标记概率的一些黑盒LLM场景,还提出了一种适应性替代方法,该方法的性能接近于LSCL且优于基线模型。
https://arxiv.org/abs/2602.10801
Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.
软件漏洞检测(SVD)是现代系统中一个关键的挑战。大型语言模型(LLMs)能够提供自然语言解释与预测,但大多数研究工作集中在二元评估上,并且生成的解释通常缺乏与通用弱点枚举(CWE)类别的语义一致性。我们提出了VulReaD方法,这是一种知识图谱引导的漏洞推理和检测方法,超越了二元分类走向基于CWE级别的推理。VulReaD利用安全知识图谱(KG)作为语义骨干,并采用一个强大的教师LLM生成与CWE一致的对比推理监督,从而在没有人工注释的情况下训练学生模型。通过使用比值偏好优化(ORPO),对学生模型进行微调以鼓励符合分类法的推理并抑制无根据的解释。在三个真实世界的数据集中,VulReaD相比于最先进的基线技术,在二元F1评分上提高了8-10%,多类分类上的Macro-F1和Micro-F1分别提升了30%和18%。 结果表明,LLMs在二元检测方面优于深度学习基线模型,并且知识图谱引导的推理能够增强CWE覆盖率及解释性。
https://arxiv.org/abs/2602.10787
In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.
上下文学习(ICL)在变换器模型中充当在线联想记忆,并被认为支持其在复杂序列处理任务中的高性能表现。然而,在门控线性注意力模型中,这种记忆具有固定的容量并且容易受到干扰,尤其是在长序列情况下更为明显。我们提出了Palimpsa,这是一种自注意力模型,它将ICL视为一个持续学习问题,该问题必须解决稳定性-可塑性的困境。在Palimpsa中使用了贝叶斯元可塑性机制,其中每个注意状态的可塑性与其重要性状态相绑定,而这种重要性又基于先验分布来捕捉累积知识。我们展示了各种门控线性注意力模型可以作为特定架构选择和后验近似出现,并且Mamba2是Palimpsa的一个特殊情况,在该情况下遗忘占据主导地位。这一理论链接使得任何非元可塑性的模型都可以转化为元可塑性的模型,从而显著扩大其记忆容量。我们的实验表明,与基准相比,Palimpsa在多查询联想回忆(MQAR)基准和常识推理任务上始终表现出更优的性能。
https://arxiv.org/abs/2602.09075
Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.
视觉语言模型(VLM)展示了其卓越的通用能力,但在医学成像或几何问题求解等专业领域中往往表现不足。监督微调(SFT)可以增强目标领域的性能,但通常会导致灾难性遗忘,从而限制了其泛化能力。因此,核心挑战在于如何在保持模型通用功能的同时适应新领域。对于大规模语言模型而言,持续预训练是一种有效的知识扩展方法,但对于视觉语言模型来说,则因为计算成本过高和大多数开源模型缺乏可用的预训练数据而变得不可行。这促使人们需要寻找高效的后期训练适应方法。 基于强化学习的方法(如群体相对策略优化GRPO)在保持通用能力方面显示出希望,但在初始阶段由于模型缺乏足够的领域知识而导致适应特定领域的场景中经常失败,导致优化崩溃。为解决这一差距,我们提出了受课程意识的渐进调节机制所驱动的新训练后范式——强化课程预对齐(RCPA)。在早期阶段,RCPA通过对部分输出施加限制来安全地让模型接触新的领域概念。随着模型领域熟悉度增加,培训逐渐转向全面生成优化,以精炼响应并使其与特定领域的偏好保持一致。这种分阶段适应方法平衡了领域知识的获取与通用多模态能力的保留。 跨专业领域和通用基准进行广泛实验验证了RCPA的有效性,并为构建高性能且具有领域适应性的VLM提供了切实可行的道路。
https://arxiv.org/abs/2602.10740
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
由于在大型音频语言模型(LALM)方面取得了显著进展,这些模型在与声音、语音和音乐相关的各种任务中表现出卓越性能,因此人们对提出评估这类模型的基准产生了浓厚兴趣。现有的大多数基准测试主要集中在利用内部知识进行推理上,而忽略了需要外部信息支持的真实世界场景。为了解决这一问题,我们引入了AudioRAG,这是一个新型基准测试,旨在评估在现实网络环境中通过信息检索增强的音频推理能力。该基准包括由大型语言模型生成和人工策划的问题与答案对。我们的评估结果显示,即使是最先进的LALM也难以回答这些问题。因此,我们提出了一种集成音频推理和检索增强生成的代理管道方案,为未来的研究提供了一个更强有力的基础。
https://arxiv.org/abs/2602.10656
While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.
尽管最近在强化微调(Reinforcement Fine-Tuning,RFT)方面的进展表明基于规则的奖励方案可以有效地对大规模语言模型进行后期训练,但这些方法向跨模态、视觉主导领域扩展的研究仍然相对较少。特别是在医学影像领域,有效的性能要求具备强大的视觉感知能力和结构化推理能力,这种局限性尤为明显。为了填补这一空白,我们在本文中提出了一种专为医疗领域设计的视觉强化微调框架VRFT-Aug。该框架引入了一系列旨在增强感知和推理的训练策略,包括先验知识注入、感知驱动的策略改进、医学导向的奖励塑造以及行为模仿。 通过在多个医学数据集上的广泛实验,我们证明了我们的方法在标准监督微调和RFT基准模型之上持续表现出色。此外,我们还提供了基于实证的研究见解和实用训练技巧,这些可以推广到其他医疗影像任务中。我们希望这项工作能够为开发可靠、具备推理能力的高风险医学应用模型提供切实可行的指导和新的灵感。
https://arxiv.org/abs/2602.10619
Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a tree-structured state space when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the directed acyclic graph (DAG) state space. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose Flow of SpanS (FOSS), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5% over Transformer on text generation and achieves 3.5% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.
标准的自回归语言模型从固定词汇表中逐词生成文本,当把单词采样视为一个动作时,这种做法会产生树状结构的状态空间。这样的方法限制了灵活性和表达能力。最近的研究通过抽取检索到的文本片段来引入动态词汇表,但忽略了同一个句子可以由不同长度的片段组成这一事实,并且没有显式地对有向无环图(DAG)状态空间进行建模。这导致了组合路径探索受限并偏向于所选路径。 生成流网络(GFlowNets)对于有效探索和泛化处理具有DAG结构的状态空间非常有用。然而,基于GFlowNet的语言模型先前仅在词级别上操作,并且仍然局限于树状结构的空间中工作,从而限制了它们的潜力。在这项工作中,我们提出了SpanS流(FOSS),这是一种用于片段生成的原则性GFlowNets框架。FoSS通过灵活地分割检索到的文本构建动态片段词汇表,确保了DAG结构的状态空间,使GFlowNets能够探索多种组合路径并提高泛化能力。 借助专门设计的奖励模型,FoSS可以产生多样且高质量的文本。实证研究表明,在文本生成方面,FoSS相比Transformer提高了高达12.5%的MAUVE分数,并在知识密集型任务上取得了3.5%的改进,持续优于当前最先进的方法。规模实验进一步表明,随着模型更大、数据更多以及检索语料库更丰富时,FoSS能够保持其相对于强基线的优势。
https://arxiv.org/abs/2602.10583
A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.
长久以来,机器人技术的一个重要目标是创建一种通用策略,在无需针对每种机器人实体进行调整的情况下就能直接应用于新类型的机器人上。尽管大规模的多实体预训练已经在现有视觉-语言-动作模型(VLAs)中实施,这些模型仍然与其特定训练实体紧密相关,并且通常需要昂贵的微调过程。我们引入了一种名为语言-行动预训练(LAP)的技术,这是一种简单的方案,它直接使用自然语言来表示低级别的机器人操作指令,从而将操作监督与预先训练好的视觉-语言模型的输入输出分布对齐。LAP技术无需学习分词器、昂贵的数据标注或特定实体的架构设计。 基于LAP技术,我们提出了LAP-3B模型,据我们所知,这是第一个能在未见过的新机器人实体上实现显著零样本迁移能力而不进行任何针对具体实体微调的视觉-语言-动作模型。在多种新型机器人和操作任务中,LAP-3B达到了超过50%的平均零样本成功率,相较于最强前代VLA模型,性能提升了约2倍。 此外,我们还展示了LAP能够支持高效的适应性和有利的增长特性,并且通过共享的语言-行动格式统一了动作预测与视觉问答(VQA),使得协同训练可以获得额外的改进。
https://arxiv.org/abs/2602.10556
Recent advances in large language models (LLMs) have enabled the development of multimodal medical AI. While models such as MedGemini achieve high accuracy on VQA tasks like USMLE MM, their performance on ECG based tasks remains limited, and some models, such as MedGemma, do not support ECG data at all. Interpreting ECGs is inherently challenging, and diagnostic accuracy can vary depending on the interpreter's experience. Although echocardiography provides rich diagnostic information, it requires specialized equipment and personnel, limiting its availability. In this study, we focus on constructing a robust ECG encoder for multimodal pretraining using real world hospital data. We employ SigLIP, a CLIP based model with a sigmoid based loss function enabling multi label prediction, and introduce a modified loss function tailored to the multi label nature of ECG data. Experiments demonstrate that incorporating medical knowledge in the language model and applying the modified loss significantly improve multi label ECG classification. To further enhance performance, we increase the embedding dimensionality and apply random cropping to mitigate data drift. Finally, per label analysis reveals which ECG findings are easier or harder to predict. Our study provides a foundational framework for developing medical models that utilize ECG data.
近期,在大型语言模型(LLM)领域的进步促进了多模态医学人工智能的发展。然而,尽管像MedGemini这样的模型在诸如USMLE MM的视觉问答任务中表现出高精度,它们在基于心电图的任务上的性能仍然有限;有些模型甚至不支持心电图数据,比如MedGemma。解读心电图具有固有的挑战性,并且诊断准确性会因解释者的经验而异。虽然超声心动图提供了丰富的诊断信息,但其需要专门的设备和专业人员操作,这限制了其可用性。在本研究中,我们专注于使用真实世界的医院数据构建一个稳健的心电图编码器以用于多模态预训练。我们采用了基于CLIP模型的SigLIP,并加入了一个基于Sigmoid函数的损失函数来实现多标签预测,同时引入了一种针对心电图数据特性调整过的损失函数。实验表明,在语言模型中融入医学知识并应用修改后的损失函数可以显著提高多标签心电图分类的效果。为了进一步提升性能,我们增加了嵌入维度,并应用了随机裁剪以减少由于数据漂移带来的影响。最后,基于每个标签的分析揭示了哪些心电图发现更容易或更难预测。我们的研究为利用心电图数据开发医学模型奠定了基础框架。
https://arxiv.org/abs/2602.10553
Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround -- tuning on smaller models and extrapolating via hyperparameter scaling laws -- is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by $\mu$P and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of $\mu$Transfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.
现代的大规模神经网络通常会以多种尺寸进行训练和发布,以便适应不同的推理预算。为了提高效率,近期的工作探索了模型放大(upscale)的方法:从已训练的小型模型初始化更大的模型来传递知识并加速收敛。然而,这种方法对目标放大模型大小下的超参数敏感,直接调整这些超参数的成本非常高昂。目前尚不清楚最常见的替代方案——在较小的模型上进行调优并通过超参数缩放定律外推是否仍然适用于使用放大方法的情况。 我们通过针对模型宽度采用有原则的方法来解决这个问题,并在此设置下有效地调优超参数。首先,受$\mu$P和任意维度架构的启发,我们引入了一种广泛适用的各种架构和优化器的一般放大方法,该方法由理论支持保证模型与它们加宽后的版本等价,并允许对无限宽度极限进行严格的分析。其次,我们将$\mu$Transfer的理论扩展为使用我们的方法放大的模型上的超参数迁移技术,并通过在实际数据集和架构上进行了实证研究,证明这种方法的有效性。
https://arxiv.org/abs/2602.10545
The integration of artificial intelligence (AI) into healthcare is accelerating, yet medical education has not kept pace with these technological advancements. This paper synthesizes current knowledge on AI in medical education through a comprehensive analysis of the literature, identifying key competencies, curricular approaches, and implementation strategies. The aim is highlighting the critical need for structured AI education across the medical learning continuum and offer a framework for curriculum development. The findings presented suggest that effective AI education requires longitudinal integration throughout medical training, interdisciplinary collaboration, and balanced attention to both technical fundamentals and clinical applications. This paper serves as a foundation for medical educators seeking to prepare future physicians for an AI-enhanced healthcare environment.
将人工智能(AI)融入医疗领域的步伐正在加快,然而医学教育并没有跟上这些技术进步的步伐。本文通过全面分析文献,综合当前关于在医学教育中应用AI的知识,识别关键能力、课程方法和实施策略。其目的在于强调在整个医学学习过程中进行有组织的AI教育培训的重要性,并提供一个课程开发框架。研究结果表明,有效的AI教育需要在整个医学培训过程中持续整合,促进跨学科合作,并且要兼顾技术基础和临床应用的平衡。本文为希望培养未来医生以适应增强型医疗环境的医学教育者提供了基础参考。
https://arxiv.org/abs/2602.10527
Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.
地图是结构化和情境知识的强大载体,涵盖了地理、人口统计学、基础设施以及环境模式等方面。在这样的知识上进行推理需要模型整合空间关系、视觉线索、现实世界的上下文及特定领域的专业知识——而这正是目前的大型语言模型(LLMs)和视觉-语言模型(VLMs)仍然难以持续展现的能力。然而,用于评估基于地图推理的VLM的数据集在范围上仍较狭窄,局限于特定领域,并且过度依赖人工生成的内容(如来自LLM或管道方法的输出),对于真正地理空间推理深度的评价有限度。 为了解决这一差距,我们推出了MapVerse,这是一个基于真实世界地图构建的大规模基准测试。该数据集中包含了11,837个人撰写的问答对,涵盖1,025张地图,并跨越十种不同的地图类别和每个类别的多种问题类型。这个数据集提供了一个丰富的场景,用于评估地图阅读、解读及多模态推理能力。 我们通过使用十个最先进的模型来测试我们的基准以建立基线并量化推理差距。除了总体性能之外,我们还进行了细粒度的分类分析,评估了模型在多个维度上的推断,并研究影响推理结果的视觉因素。研究表明,虽然目前的VLM在类任务上表现出竞争力,但在需要复杂空间推理的高级任务中,无论是开源还是闭源的模型都表现不佳。
https://arxiv.org/abs/2602.10518
Graph Domain Adaptation (GDA) aims to bridge distribution shifts between domains by transferring knowledge from well-labeled source graphs to given unlabeled target graphs. One promising recent approach addresses graph transfer by discretizing the adaptation process, typically through the construction of intermediate graphs or stepwise alignment procedures. However, such discrete strategies often fail in real-world scenarios, where graph structures evolve continuously and nonlinearly, making it difficult for fixed-step alignment to approximate the actual transformation process. To address these limitations, we propose \textbf{DiffGDA}, a \textbf{Diff}usion-based \textbf{GDA} method that models the domain adaptation process as a continuous-time generative process. We formulate the evolution from source to target graphs using stochastic differential equations (SDEs), enabling the joint modeling of structural and semantic transitions. To guide this evolution, a domain-aware network is introduced to steer the generative process toward the target domain, encouraging the diffusion trajectory to follow an optimal adaptation path. We theoretically show that the diffusion process converges to the optimal solution bridging the source and target domains in the latent space. Extensive experiments on 14 graph transfer tasks across 8 real-world datasets demonstrate DiffGDA consistently outperforms state-of-the-art baselines.
图域适应(Graph Domain Adaptation,GDA)旨在通过将知识从标注良好的源图转移到给定的未标注目标图来弥合不同领域之间的分布差异。一种近期有前景的方法是通过构建中间图或逐步对齐过程来离散化迁移过程以解决图转移问题。然而,在现实世界场景中,图结构会连续且非线性地演变,这使得固定的步长对准难以近似真实的转变过程。为了解决这些限制,我们提出了一种名为**DiffGDA**的方法,这是一种基于扩散的图域适应方法,它将域适应过程建模为一个连续时间生成过程。我们使用随机微分方程(SDEs)来描述从源图到目标图的演变过程,从而使结构和语义转变能够同时被建模。为了引导这一演化过程,引入了一个领域感知网络以导向生成过程朝向目标域,鼓励扩散轨迹遵循最佳适应路径。理论上证明了该扩散过程在潜在空间中收敛于连接源域与目标域的最佳解。通过14个图迁移任务和8个真实世界数据集的广泛实验表明,DiffGDA始终优于最先进的基准方法。
https://arxiv.org/abs/2602.10506
Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.
基于知识编辑(KE)的去毒化已成为减轻大型语言模型有害行为的一种有前景的方法。然而,现有的评估方法主要依赖于自动毒性分类器,并且隐含地假设降低的毒性分数反映了真正的行为抑制。在这项工作中,我们提出了一种针对基于KE的去毒化的以鲁棒性为导向的评价框架,该框架超越了标准分类器指标,在三个维度上考察其可靠性:优化鲁棒性、组合鲁棒性和跨语言鲁棒性。 我们识别出伪去毒化作为常见的失败模式,在这种情况下,出现的毒性减少来自于退化的生成行为而非有意义地抑制不安全内容。此外,当多个不安全行为被联合编辑时,去毒效果会下降,并且单语和跨语种的去毒只有在特定模型-方法组合下才能保持有效。 总体而言,我们的结果表明基于KE的去毒化仅对某些模型、有限数量的去毒目标以及部分语言是可靠的。
https://arxiv.org/abs/2602.10504
Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs but is challenged by complex, multi-faceted distributional shifts. Existing methods attempt to reduce distributional shifts by aligning manually selected graph elements (e.g., node attributes or structural statistics), which typically require manually designed graph filters to extract relevant features before alignment. However, such approaches are inflexible: they rely on scenario-specific heuristics, and struggle when dominant discrepancies vary across transfer scenarios. To address these limitations, we propose \textbf{ADAlign}, an Adaptive Distribution Alignment framework for GDA. Unlike heuristic methods, ADAlign requires no manual specification of alignment criteria. It automatically identifies the most relevant discrepancies in each transfer and aligns them jointly, capturing the interplay between attributes, structures, and their dependencies. This makes ADAlign flexible, scenario-aware, and robust to diverse and dynamically evolving shifts. To enable this adaptivity, we introduce the Neural Spectral Discrepancy (NSD), a theoretically principled parametric distance that provides a unified view of cross-graph shifts. NSD leverages neural characteristic function in the spectral domain to encode feature-structure dependencies of all orders, while a learnable frequency sampler adaptively emphasizes the most informative spectral components for each task via minimax paradigm. Extensive experiments on 10 datasets and 16 transfer tasks show that ADAlign not only outperforms state-of-the-art baselines but also achieves efficiency gains with lower memory usage and faster training.
图域自适应(GDA)是一种将标记源图的知识转移到未标记的目标图中的方法,但在面对复杂且多方面的分布变化时会遇到挑战。现有的方法试图通过手动选择的图元素对齐来减少这些分布差异(例如节点属性或结构统计),通常需要人工设计的图过滤器提取相关特征进行对齐操作。然而,这种方法缺乏灵活性:它们依赖于特定场景的启发式方法,在主要差异随传输场景变化时表现不佳。 为了克服这些限制,我们提出了**ADAlign**——一个用于GDA的自适应分布对齐框架。与基于启发式的传统方法不同,ADAlign不需要手动指定对齐标准。它能够自动识别每个转移中的最相关差异,并进行联合对齐,捕捉属性、结构及其依赖关系之间的相互作用。这使得ADAlign具有灵活性、场景感知能力以及面对多样且动态变化的偏移时保持稳健性。 为了实现这种适应性,我们引入了神经谱差异(NSD),这是一种理论上合理的参数距离度量,可以提供跨图转移的统一视角。NSD利用了频谱域中的神经特征函数来编码所有阶别的特性-结构依赖关系,并通过一种可学习的频率采样器在极小极大博弈范式下为每个任务自适应地强调最富有信息性的频谱成分。 广泛的实验验证显示,在10个数据集和16项转移任务上,ADAlign不仅超越了最先进的基线方法,而且实现了更低内存使用量和更快训练速度的效率提升。
https://arxiv.org/abs/2602.10489