Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
大型语言模型(LLM)非常强大,但由于静态知识的限制,它们容易产生幻觉。检索增强生成(RAG)通过注入外部信息来帮助解决这一问题,但目前的方法往往成本高昂、泛化能力差或忽视了模型内部的知识。在本文中,我们引入了一个名为 R1-Searcher++ 的新框架,旨在训练 LLM 自适应地利用内外部知识源。 R1-Searcher++ 采用两阶段的训练策略:初始的 SFT Cold-start 阶段用于初步格式学习,随后是使用结果监督进行动态知识获取的强化学习(RL)阶段。该 RL 阶段包括一个奖励机制来鼓励模型利用内部知识,并结合了记忆机制以持续吸收检索到的信息,从而丰富模型的内部知识。 通过整合内部知识和外部搜索引擎,R1-Searcher++ 模型能够不断提升其能力,实现高效的检索增强推理。我们的实验表明 R1-Searcher++ 超过了之前的 RAG 和推理方法,并实现了高效检索。代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.17005
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL
我们提出了一种基于偏微分方程(PDE)逆问题的条件采样通用框架,旨在从极度稀疏或噪声较大的测量数据中恢复完整的解。此目标通过函数空间扩散模型和插件播放指导来实现条件设置。我们的方法首先使用神经算子架构训练一个无条件且与离散化无关的去噪模型。在推断阶段,我们利用基于梯度的引导机制细化样本以满足稀疏观测数据的要求。通过严格的数学分析,我们将Tweedie公式扩展到无限维希尔伯特空间中,为我们的后验采样方法提供了理论基础。我们的方法(FunDPS)在极小监督和极端数据稀缺条件下准确捕捉函数空间中的后验分布。在五项仅包含3%观测数据的PDE任务上,与最先进的固定分辨率扩散基准相比,我们的方法平均提高了32%的准确性,并将采样步骤减少了4倍。此外,多分辨率微调确保了强大的跨分辨率泛化能力。据我们所知,这是首个独立于离散化的基于扩散的方法框架,为偏微分方程上下文中的正向和逆向问题提供了一种实用且灵活的解决方案。代码可在此网址获得:[请在这里插入实际链接]
https://arxiv.org/abs/2505.17004
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.
开放词汇分割(OVS)因其能够将分割任务推广到预定义类别之外的能力而越来越受到关注。然而,现有的方法通常通过简单的前向推理来预测分割掩码,缺乏明确的推理过程和可解释性。这使得OVS模型在开放式环境中难以区分相似的类别,因为缺少上下文理解和具有判别性的视觉线索。 为了解决这一局限性,我们提出了一种逐步视觉推理框架,用于开放词汇分割,并将其命名为OpenSeg-R。所提出的OpenSeg-R利用大型多模态模型(LMMs)来进行分层视觉推理,在进行分割之前完成该步骤。具体而言,针对每张图像生成通用和特定于图像的推理内容,形成结构化三元组,以从粗到细的方式解释对象的视觉原因。基于这些推理步骤,我们可以合成详细的描述提示,并将其输入到分割器中,从而产生更准确的分割掩码。 据我们所知,OpenSeg-R是第一个将明确的逐步视觉推理引入OVS框架的方法。实验结果表明,在五个基准数据集上的开放词汇语义分割任务上,OpenSeg-R显著优于当前最佳方法。此外,在开放词汇全景分割的所有指标中也取得了持续性的改进。定性结果显示了我们推理引导框架在提高分割精度和可解释性方面的有效性。 我们的代码可在以下链接获取:[提供URL的地方](请替换为实际的代码公开链接)。
https://arxiv.org/abs/2505.16974
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
在这篇论文中,我们结合了两步知识蒸馏、结构化剪枝、截断和词汇精简技术,以极大压缩低资源语言的多语种编码器模型。我们的创新方法系统性地整合现有技术并推向极限,减少层数深度、前向反馈隐藏层大小以及中间层嵌入尺寸,从而创建显著更小的语言特定单语模型,同时保留了关键的语言特性知识。我们在包括情感分析、主题分类、命名实体识别和词性标注在内的四个下游任务中,在三种低资源语言上实现了高达92%的压缩率,并且仅在性能上有微不足道的下降(2-10%)。值得注意的是,性能下降的程度与教师模型中的语言特定数据量相关,较大的数据集导致较小的性能损失。此外,我们进行了详尽的消融研究,以确定使用这些技术进行多语种模型压缩的最佳实践。
https://arxiv.org/abs/2505.16956
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
人工智能(AI)正在加速科研范式的转变,不仅提升了研究效率,还推动了创新。我们推出了NovelSeek,这是一个统一的闭环多智能体框架,用于在多个科学领域中进行自主科学研究(ASR),使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。NovelSeek突出三大优势: 1. **可扩展性**:NovelSeek已在12项科研任务中展示了其适应能力,能够在多种基线代码的性能提升方面生成创新想法。 2. **交互性**:NovelSeek提供了一个接口,支持人类专家反馈和多智能体互动,在自动化端到端过程中能够无缝集成领域专业知识。 3. **效率**:相比人工努力,NovelSeek在多个科学领域中实现了显著的时间成本节约,并取得了令人瞩目的性能提升。例如,在反应产率预测方面,其性能从27.6%提升至35.4%,仅耗时12小时;在增强子活性预测上,准确度从0.52升至0.79,仅需4小时的处理时间;而在二维语义分割领域,精度提升了近三个百分点,在短短30小时内由78.8%提高到81.0%。
https://arxiv.org/abs/2505.16938
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
个人可识别信息(PII)的匿名化是一项高风险任务,为许多开放科学数据共享倡议设置了障碍。虽然近年来在PII识别方面已经取得了显著进展,但在实践中,错误阈值和召回率/精确度权衡仍然限制了这些匿名化管道的应用推广。我们提出了一个名为PIIvot的轻量级框架,该框架利用对数据上下文的理解来简化PII检测问题。为了证明其有效性,我们也贡献了一个名为QATD-2k的数据集,这是同类中最大的开源现实世界辅导数据集,以支持高质量教育对话数据的需求。
https://arxiv.org/abs/2505.16931
Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
大型语言模型(LLM)在长文本生成中容易出现幻觉,尤其是当它们缺乏足够的知识时。减轻这一问题的一个有前景的方向是教会这些模型在其知识不足的情况下明确表达不确定性。然而,现有的研究工作缺少对LLM在长文本生成中有效表达不确定性的直接和公平的评估方法。为了弥补这一空白,我们首次引入了UNCLE基准测试,旨在评估不确定性表达能力,涵盖长格式和短格式问题回答(QA)。UNCLE涉及五个领域,并包含4000个长格式问答实例以及超过20,000对短格式问答对。我们的数据集是首个直接连接长短形式QA的集合,其中包含配对的问题和标准答案。 除了基准测试之外,我们还提出了一系列新指标来评估模型选择性表达不确定性的能力。通过使用UNCLE,我们展示了当前模型在长文本生成中未能适当传达不确定性这一问题。此外,我们进一步探索了基于提示和训练的方法以提高模型性能,并发现基于训练的方法取得了更显著的改进。 对短格式与长格式之间不确定性表达的一致性差距进行深入分析后,该研究为未来使用UNCLE开展的研究指出了有前景的方向。
https://arxiv.org/abs/2505.16922
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.
蛋白质语言模型(pLMs)在庞大的蛋白质序列数据库上预训练后,在各种下游任务中表现出色,但缺乏许多生物学应用所需的重要结构知识。为了弥补这一不足,我们将来自预训练的蛋白质图神经网络(pGNNs)的结构洞察融入到pLMs中,通过一个潜在层次对比学习的任务实现这一点。此任务使来自pLMs和pGNNs的残基表示在多个蛋白质之间对齐,从而丰富了pLMs的跨蛋白结构知识。此外,我们还引入了一个物理层面的任务,通过优化pLM预测结构标记的能力来注入同蛋白内的结构知识。提出的双任务框架有效将跨蛋白与同蛋白内结构知识融入到pLMs中。 鉴于PDB中的蛋白质结构质量存在差异性,我们进一步引入了一种残基损失选择模块,该模块使用在高质量结构上训练的小型模型为pLM挑选可靠且具有挑战性的残基损失。应用我们的结构对齐方法至最先进的ESM2和AMPLIFY模型,在包括ESM2接触预测在内的广泛任务中实现了显著的性能提升(提高了12.7%)。数据、代码以及生成的SaESM2和SaAMPLIFY模型将在Hugging Face上发布。
https://arxiv.org/abs/2505.16896
Shared autonomy is an enabling technology that provides users with control authority over robots that would otherwise be difficult if not impossible to directly control. Yet, standard methods make assumptions that limit their adoption in practice-for example, prior knowledge of the user's goals or the objective (i.e., reward) function that they wish to optimize, knowledge of the user's policy, or query-level access to the user during training. Diffusion-based approaches to shared autonomy do not make such assumptions and instead only require access to demonstrations of desired behaviors, while allowing the user to maintain control authority. However, these advantages have come at the expense of high computational complexity, which has made real-time shared autonomy all but impossible. To overcome this limitation, we propose Consistency Shared Autonomy (CSA), a shared autonomy framework that employs a consistency model-based formulation of diffusion. Key to CSA is that it employs the distilled probability flow of ordinary differential equations (PF ODE) to generate high-fidelity samples in a single step. This results in inference speeds significantly than what is possible with previous diffusion-based approaches to shared autonomy, enabling real-time assistance in complex domains with only a single function evaluation. Further, by intervening on flawed actions at intermediate states of the PF ODE, CSA enables varying levels of assistance. We evaluate CSA on a variety of challenging simulated and real-world robot control problems, demonstrating significant improvements over state-of-the-art methods both in terms of task performance and computational efficiency.
共享自主性是一种使技术,它让用户能够控制那些原本难以直接操控的机器人。然而,标准方法会做出一些假设,限制了它们在实际中的采用——例如,需要预先了解用户的任务目标或他们希望优化的目标(即奖励函数)、用户策略的知识或者在训练过程中询问用户的需求。基于扩散的方法则不依赖于这些假设,仅需访问期望行为的演示,并允许用户保留控制权。然而,这种优势是以计算复杂度高为代价实现的,这使得实时共享自主性变得几乎不可能实现。 为了克服这一限制,我们提出了Consistency Shared Autonomy(CSA),这是一种采用基于一致性的模型扩散方法的共享自主框架。CSA的关键在于它使用普通微分方程中的概率流蒸馏来生成单一步骤中的高保真样本。这种方法使得推理速度远超之前的扩散方式在共享自主性上的表现,能够在复杂领域实现实时辅助,仅需一次函数评估即可完成。此外,通过在PF ODE(概率流微分方程)的中间状态干预错误行为,CSA能够提供不同程度的帮助。 我们在各种具有挑战性的模拟和真实世界机器人控制问题上对CSA进行了测试,展示了其在任务性能和计算效率方面显著优于最先进的方法。 翻译如下: 共享自主性是一种关键技术,它允许用户对其难以直接操控的机器人进行控制。然而,标准的方法往往做出一些限制其实际应用的假设——例如,假定事先了解用户的任务目标或他们想要优化的目标(即奖励函数)、知晓用户策略或者在训练过程中能够获取到用户的反馈信息。基于扩散的方法不依赖于这些假设,并且只需要访问期望行为的数据展示即可进行操作,同时允许用户保留控制权。但是,这种优势是以极高的计算复杂度为代价的,这使得实时共享自主性变得几乎不可能实现。 为了克服这一限制,我们提出了Consistency Shared Autonomy(CSA),这是一种基于一致性的模型扩散方法框架下的共享自主系统。CSA的核心在于它使用了普通微分方程中的概率流蒸馏技术,在单一步骤中生成高保真的样本。这种方法显著提高了推理速度,使得在复杂环境中实现实时辅助成为可能,并且只需进行一次函数评估即可完成。此外,通过干预PF ODE(概率流微分方程)中间状态下的错误行为,CSA能够提供不同程度的帮助。 我们在一系列具有挑战性的模拟和现实世界机器人控制问题上对CSA进行了测试,结果显示其在任务性能和计算效率方面均显著优于现有的最佳方法。
https://arxiv.org/abs/2505.16892
Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
知识图谱嵌入(KGE)方法中的不确定性量化对于确保下游应用的可靠性至关重要。最近的一项工作将符合预测应用于KGE方法,通过生成一组包含真实答案且保证达到预定义置信水平的答案集合来提供不确定性估计。然而,现有的方法仅提供了基于查询和答案参考集上的概率保证(边际覆盖保证)。在高风险应用场景中,如医学诊断,通常需要更强的保证:预测集必须为每个单独的查询提供一致的覆盖率(条件覆盖保证)。 我们提出了一种名为CondKGCP的新方法,该方法可以近似谓词条件下的覆盖保证,并同时保持紧凑的预测集合。CondKGCP通过合并具有相似向量表示的谓词并利用排名信息进行校准来实现这一目标。我们证明了CondKGCP的理论保证并通过全面评估展示了其在实际应用中的有效性。
https://arxiv.org/abs/2505.16877
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
持续后期训练使单一的文本到图像扩散模型能够在不增加单独模型成本的情况下学习新任务,但简单的后期训练会导致遗忘预训练知识,并损害零样本组合性。我们观察到缺乏标准化的评估协议阻碍了相关研究的发展,特别是对于持续后期训练的研究。为此,我们引入了T2I-ConBench,这是一个用于文本到图像模型持续后期训练的统一基准测试平台。T2I-ConBench专注于两个实际场景:项目定制和领域增强,并从四个维度进行分析:(1)通用性保留;(2)目标任务性能;(3)灾难性遗忘;以及(4)跨任务泛化能力。它结合了自动化指标、人类偏好建模及视觉-语言问答,以进行全面评估。我们对十种代表性方法进行了三个实际的任务序列基准测试,并发现没有一种方法在所有方面都表现出色。即使联合“oracle”训练也不适用于每个任务,而跨任务的泛化问题仍未解决。我们发布了所有数据集、代码和评估工具,以加速文本到图像模型持续后期训练的研究进展。
https://arxiv.org/abs/2505.16875
This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs. Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy. The "adapt" phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the "generate memory" phase, guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.
这篇论文解决了在不断变化的、多种分布外(OOD)图上的图领域适应挑战。传统的图领域适应方法局限于单步适应,这使得它们无法有效处理连续的领域偏移,并且容易出现灾难性遗忘问题。本文提出了图持续自适应学习(GCAL)方法,旨在增强模型在不同图域中的可持续性和适应能力。 GCAL采用双层优化策略。“适应”阶段使用信息最大化的方法对新领域的图进行微调,并重新调整过去的记忆以减少遗忘现象的发生。同时,“生成记忆”阶段通过从信息瓶颈理论推导出的理论下限指导,包含了一个变分记忆图生成模块,将原始图压缩成记忆。 广泛的实验评估表明,GCAL在适应性和知识保留方面显著优于现有的方法。
https://arxiv.org/abs/2505.16860
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at this https URL.
GUI自动化在动态环境中面临关键挑战。大规模语言模型(MLLMs)存在两个主要问题:误判UI组件和知识过时。传统的微调方法对于应用程序特定的知识更新成本高昂。我们提出了一种名为GUI-explorer的无训练需求的GUI代理,该代理融合了两种基本机制: 1. 功能感知轨迹自主探索。为了全面覆盖所有应用功能,我们设计了一个基于分析GUI结构信息(如屏幕截图和活动层次)的功能感知任务目标生成器,自动构建探索目标。这使得系统化地进行多样化轨迹收集成为可能。 2. 过渡感知知识无监督挖掘。为建立精确的屏幕操作逻辑,我们开发了一种过渡感知知识提取器,通过分析结构化的交互三元组(观察、行动、结果)的状态转换来进行有效的屏幕操作逻辑的无监督学习。这消除了人工参与知识抽取的需求。 在SPA-Bench和AndroidWorld基准测试中,GUI-explorer的任务成功率分别达到了53.7%和47.4%,显著优于现有最优方法(SOTA)代理,并且对于新应用无需参数更新即可使用。GUI-explorer已开源并公开发布于以下链接:[https URL]。
https://arxiv.org/abs/2505.16827
Existing methods for multimodal MRI segmentation with missing modalities typically assume that all MRI modalities are available during training. However, in clinical practice, some modalities may be missing due to the sequential nature of MRI acquisition, leading to performance degradation. Furthermore, retraining models to accommodate newly available modalities can be inefficient and may cause overfitting, potentially compromising previously learned knowledge. To address these challenges, we propose Replay-based Hypergraph Domain Incremental Learning (ReHyDIL) for brain tumor segmentation with missing modalities. ReHyDIL leverages Domain Incremental Learning (DIL) to enable the segmentation model to learn from newly acquired MRI modalities without forgetting previously learned information. To enhance segmentation performance across diverse patient scenarios, we introduce the Cross-Patient Hypergraph Segmentation Network (CHSNet), which utilizes hypergraphs to capture high-order associations between patients. Additionally, we incorporate Tversky-Aware Contrastive (TAC) loss to effectively mitigate information imbalance both across and within different modalities. Extensive experiments on the BraTS2019 dataset demonstrate that ReHyDIL outperforms state-of-the-art methods, achieving an improvement of over 2\% in the Dice Similarity Coefficient across various tumor regions. Our code is available at ReHyDIL.
现有的多模态MRI分割方法通常假设在训练期间所有MRI模态都可用。然而,在临床实践中,由于MRI采集的顺序特性,某些模态可能会缺失,从而导致性能下降。重新训练模型以适应新出现的模态既低效又可能导致过拟合,这可能破坏之前学到的知识。为了解决这些挑战,我们提出了基于重播的超图领域增量学习(ReHyDIL),用于具有缺失模态的大脑肿瘤分割。ReHyDIL利用领域增量学习(DIL)使分割模型能够从新获取的MRI模态中学习而不忘记先前学到的信息。 为了增强在各种患者场景中的分割性能,我们引入了跨患者的超图分割网络(CHSNet),该网络使用超图来捕捉不同患者之间的高阶关联。此外,我们还采用了Tversky感知对比损失(TAC)以有效缓解不同模态之间以及内部的信息不平衡问题。 在BraTS2019数据集上的广泛实验表明,ReHyDIL优于最先进的方法,在各个肿瘤区域的Dice相似系数上取得了超过2%的改进。我们的代码可在ReHyDIL项目中获取。
https://arxiv.org/abs/2505.16809
Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at this https URL.
大型语言模型(LLM)在知识密集型多步推理(KIMSR)任务中遇到困难。其中一个挑战是如何有效地提取和表示理性证据。当前的方法往往提取出语义相关但逻辑无关的证据,导致推理错误并产生不准确的答案。我们提出了一种双向证据自我对齐(TW-ESA)模块,该模块利用严格推理与LLM推理之间的相互对齐来增强其对证据因果逻辑的理解,从而解决了第一个挑战。 另一个挑战是如何在证据中存在不确定性的情况下使用理性证据和LLM的内在知识进行准确推理。我们提出了一种双门控推理增强(DGR)模块,逐步融合LLM中的有用知识到严格的推理过程中,使模型能够通过聚焦于证据中的因果要素来进行准确推理,并表现出更大的鲁棒性。 这两个模块在统一框架ESA-DGR中协同训练。在三个多样且具有挑战性的KIMSR数据集上的广泛实验表明,ESA-DGR显著超越了现有的基于LLM的微调方法,在精确匹配(EM)和F1分数上分别取得了平均4%和5%的显著改进。 该实现代码可在[此处](https://此链接需要替换为实际提供的URL)获取。
https://arxiv.org/abs/2505.16806
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
将视觉-语言模型(VLMs)集成到自主驾驶系统中,展示了在解决学习复杂性、可解释性和常识推理等关键挑战方面的潜力。然而,现有的方法常常因计算需求而难以实现高效整合和实时决策。为此,本文介绍了SOLVE框架,它通过结合VLM与端到端(E2E)模型来增强自主车辆的规划能力。我们的方法强调在特征级别上通过共享视觉编码器进行知识分享,从而使VLM和E2E组件能够全面互动。 我们提出了一种轨迹链式思考(T-CoT)范例,该范例逐步细化轨迹预测,减少不确定性并提高准确性。SOLVE采用时间解耦策略实现高效合作,在确保高质量的VLM输出的同时,也能达到E2E模型的实时性能要求。在nuScenes数据集上进行评估后,我们的方法显著提高了轨迹预测的准确性,为构建更稳健和可靠的自主驾驶系统铺平了道路。
https://arxiv.org/abs/2505.16805
Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
基于给定的文本提示生成高质量图像的文本到图像模型非常强大,但编写这些提示通常需要专门的词汇。为了解决这一问题,现有方法通过大量手动标注的数据和经过训练的美学评估模型来训练重写模型。为了减轻训练模型时对数据规模的依赖以及由训练模型引入的偏见,我们提出了一种新颖的提示优化框架,该框架旨在将简单的用户提示重新表述为复杂的、针对文本到图像模型的提示。具体而言,我们将大型视觉语言模型(LVLM)用作重写用户的输入提示的求解器,并同时将其用作奖励模型来评估由优化后的提示生成的图像在美学和一致性方面的得分。我们利用LVLM的先验知识提供反馈——即AI反馈——而不是繁琐的人类反馈。此外,我们将求解器和奖励模型统一为一个单一的模型,在强化学习中进行迭代以通过自我判断和给出解决方案的方式实现自我改进。两个流行数据集上的实验结果表明,我们的方法优于其他强大的竞争对手。
https://arxiv.org/abs/2505.16763
As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at this https URL.
随着大型语言模型(LLM)在全球应用中的日益普及,确保这些模型在不同的语言环境中无毒变得越来越重要。我们探讨了“跨语言去毒性”这一方法,它是一种能够减轻毒性影响的跨语言范式,使得去毒性能力可以在不同书写系统之间的高资源和低资源语言之间转移。为了评估这种跨语言去毒化的效果,我们在504种广泛设置下进行了测试,以评价在数据有限的情况下跨分布环境中的毒性减少情况,并探讨了缓解措施如何影响模型在非有毒任务上的性能,揭示了安全性与知识保存之间的权衡。我们的代码和数据集可以在以下网址公开获取:[此 URL]。
https://arxiv.org/abs/2505.16722