Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
从互联网视频中学习潜在运动对于构建通用型机器人至关重要。然而,现有的离散潜在动作方法存在信息损失的问题,并且难以处理复杂和细微的动态变化。为此我们提出了CoMo(Continuous Motion),旨在从多样化的、大规模的互联网视频中学习更为详尽的连续运动表示。 CoMo采用了早期时间特征差分机制来防止模型崩溃并抑制静态外观噪声,从而有效避免了捷径学习问题的发生。同时,遵循信息瓶颈原则,我们将潜在运动嵌入维度进行限制,以在保留足够的与动作相关的信息和最小化无关的外观噪声之间取得更好的平衡。 此外,我们还引入了两个新的评估指标,用于更加稳健且经济地评价运动并指导运动学习方法的发展:(i)动作预测线性探测MSE;(ii)过去到当前及未来到当前运动嵌入之间的余弦相似度。这两个指标对于衡量模型在不同时间和视角下保持一致性和相关性的能力至关重要。 最关键的是,CoMo展示出了强大的零样本泛化能力,使其能够为之前未见过的视频领域生成连续伪动作。这种能力使得利用从无标签视频数据集中提取的各种伪动作进行统一策略联合学习成为可能(例如跨实体视频和显著的人类演示视频),这在必要时可以结合有限标记的机器人数据进一步增强。 广泛的实验表明,与CoMo伪动作协同训练的策略在模拟和现实世界环境中使用扩散模型和自回归架构均表现出卓越性能。
https://arxiv.org/abs/2505.17006
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at this https URL.
出界(Out-of-distribution,OOD)检测和分割对于在自动驾驶和机器人辅助手术等安全关键应用中部署机器学习模型至关重要。尽管之前的大多数研究主要集中在单模态图像数据上,但现实世界的应用本质上是多模态的,需要整合多种模态以提高OOD检测的效果。一个关键挑战是没有来自未知数据的监督信号,导致模型在处理OOD样本时过于自信。为解决这一挑战,我们提出了特征混合(Feature Mixing)方法,这是一种极其简单且快速的方法,用于生成具有理论支持的多模态异常值,可以通过进一步优化帮助模型更好地区分已知分布(in-distribution,ID)和OOD数据。特征混合与模式无关,并适用于各种模态组合。 此外,我们还介绍了CARLA-OOD,这是一个新颖的多模态数据集,用于OOD分割任务,其中包含在不同场景和天气条件下合成的OOD物体。在SemanticKITTI、nuScenes、CARLA-OOD以及MultiOOD基准测试上进行的大量实验表明,特征混合方法能够实现最先进的性能,并且速度提高了10倍到370倍。我们的源代码和数据集将在[此处](https://this https URL)提供。 该段落翻译为中文后清晰地介绍了研究背景、提出的方法及其优势,以及用于验证新方法的数据集和实验结果。
https://arxiv.org/abs/2505.16985
Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.
基础模型在医疗保健领域展现出巨大的潜力,这是因为它们能够提取出与具体下游任务无关的有意义表示。这种特性使得这些模型即使在标注数据有限的情况下(这是医疗保健领域的常见挑战),也能在基于结构化电子健康记录(EHR)数据训练的多个临床应用中实现最先进的性能表现。然而,由于缺乏全面且有意义的任务标准以及足够多样化的评估方法来表征其相对于传统监督学习的优势,这些模型在临床实用性方面的潜力仍然存在争议。 为了弥补这一差距,我们提出了一套涵盖患者结果、急性病和慢性疾病的早期预测等具有临床意义的任务,并制定了稳健评价的标准。我们在哥伦比亚大学欧文医学中心(CUMC)提供的包含500万患者的EHR数据集上对最先进基础模型进行了评估,该数据来自纽约市的一个大型城市学术医疗中心。我们针对14个相关的临床任务进行了测试,测量了整体准确性、校准性和不同亚群体的表现,以揭示基于预训练策略、标记化和数据表示方法选择的权衡。 我们的研究旨在推进结构化EHR基础模型的经验评估,并为未来健康保健领域基础模型的发展提供指导。
https://arxiv.org/abs/2505.16941
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
人工智能(AI)正在加速科研范式的转变,不仅提升了研究效率,还推动了创新。我们推出了NovelSeek,这是一个统一的闭环多智能体框架,用于在多个科学领域中进行自主科学研究(ASR),使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。NovelSeek突出三大优势: 1. **可扩展性**:NovelSeek已在12项科研任务中展示了其适应能力,能够在多种基线代码的性能提升方面生成创新想法。 2. **交互性**:NovelSeek提供了一个接口,支持人类专家反馈和多智能体互动,在自动化端到端过程中能够无缝集成领域专业知识。 3. **效率**:相比人工努力,NovelSeek在多个科学领域中实现了显著的时间成本节约,并取得了令人瞩目的性能提升。例如,在反应产率预测方面,其性能从27.6%提升至35.4%,仅耗时12小时;在增强子活性预测上,准确度从0.52升至0.79,仅需4小时的处理时间;而在二维语义分割领域,精度提升了近三个百分点,在短短30小时内由78.8%提高到81.0%。
https://arxiv.org/abs/2505.16938
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.
蛋白质语言模型(pLMs)在庞大的蛋白质序列数据库上预训练后,在各种下游任务中表现出色,但缺乏许多生物学应用所需的重要结构知识。为了弥补这一不足,我们将来自预训练的蛋白质图神经网络(pGNNs)的结构洞察融入到pLMs中,通过一个潜在层次对比学习的任务实现这一点。此任务使来自pLMs和pGNNs的残基表示在多个蛋白质之间对齐,从而丰富了pLMs的跨蛋白结构知识。此外,我们还引入了一个物理层面的任务,通过优化pLM预测结构标记的能力来注入同蛋白内的结构知识。提出的双任务框架有效将跨蛋白与同蛋白内结构知识融入到pLMs中。 鉴于PDB中的蛋白质结构质量存在差异性,我们进一步引入了一种残基损失选择模块,该模块使用在高质量结构上训练的小型模型为pLM挑选可靠且具有挑战性的残基损失。应用我们的结构对齐方法至最先进的ESM2和AMPLIFY模型,在包括ESM2接触预测在内的广泛任务中实现了显著的性能提升(提高了12.7%)。数据、代码以及生成的SaESM2和SaAMPLIFY模型将在Hugging Face上发布。
https://arxiv.org/abs/2505.16896
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
幻觉——尽管合理但错误的输出——依然是大规模语言模型(LLM)可靠部署的关键障碍。我们首次系统研究了由增量上下文注入引起的内部状态漂移与幻觉发生率之间的联系。使用TruthfulQA,我们在每个问题上构建两个16轮“滴定”轨道:一个附加相关但部分有缺陷的片段,另一个则注入故意误导的内容。在六个开源LLM中,我们利用三视角检测器跟踪显性幻觉率,并通过余弦、熵、JS和斯皮尔曼漂移分析隐藏状态及注意力图的变化来追踪隐性动态变化。研究结果揭示了以下几点: 1. 幻觉频率与表示漂移随轮次增加而单调增长,在5-7轮后达到平台期。 2. 相关上下文驱动语义深入吸收,产生高置信度的“自我一致”幻觉;而不相关上下文则通过注意力重新定向导致主题漂错。 3. JS漂移(约0.69)与斯皮尔曼漂移(接近于零)的收敛标志着一个“注意力锁定”的阈值,在此之后,幻觉固化并变得难以纠正。 相关性分析揭示了吸收能力和注意力扩散之间的跷跷板效应,澄清了大小依赖型错误模式。这些发现为内在幻觉预测和上下文感知缓解机制提供了实证基础。
https://arxiv.org/abs/2505.16894
With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers built on Large Language Models (LLMs). These methods typically employ an LLM to produce an explicit, step-by-step reasoning process before arriving at a final relevance prediction. But, does reasoning actually improve reranking accuracy? In this paper, we dive deeper into this question, studying the impact of the reasoning process by comparing reasoning-based pointwise rerankers (ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under identical training conditions, and observe that StandardRR generally outperforms ReasonRR. Building on this observation, we then study the importance of reasoning to ReasonRR by disabling its reasoning process (ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more effective than ReasonRR. Examining the cause of this result, our findings reveal that reasoning-based rerankers are limited by the LLM's reasoning process, which pushes it toward polarized relevance scores and thus fails to consider the partial relevance of passages, a key factor for the accuracy of pointwise rerankers.
随着在复杂自然语言任务中推理模型的成功日益显著,信息检索(IR)领域的研究人员已经开始探索如何将类似的推理能力整合到基于大型语言模型(LLMs)的段落重排序器中。这些方法通常利用LLM生成一个明确、逐步的推理过程,然后得出最终的相关性预测。然而,这种推理实际上是否能提高重排序精度呢?在本文中,我们深入探讨了这个问题,通过在相同的训练条件下比较基于推理的点对点重排序器(ReasonRR)与标准的非推理点对点重排序器(StandardRR),发现StandardRR通常优于ReasonRR。在此基础上,我们进一步研究了ReasonRR中推理的重要性,并通过禁用其推理过程(ReasonRR-NoReason)来观察效果,发现令人惊讶的是,ReasonRR-NoReason比ReasonRR更有效。对此结果进行深入分析后,我们的研究发现揭示出基于推理的重排序器受限于LLM的推理过程,这导致它们倾向于产生极端的相关性评分,从而忽略了段落的部分相关性,而这是点对点重排序精度的关键因素。
https://arxiv.org/abs/2505.16886
Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
知识图谱嵌入(KGE)方法中的不确定性量化对于确保下游应用的可靠性至关重要。最近的一项工作将符合预测应用于KGE方法,通过生成一组包含真实答案且保证达到预定义置信水平的答案集合来提供不确定性估计。然而,现有的方法仅提供了基于查询和答案参考集上的概率保证(边际覆盖保证)。在高风险应用场景中,如医学诊断,通常需要更强的保证:预测集必须为每个单独的查询提供一致的覆盖率(条件覆盖保证)。 我们提出了一种名为CondKGCP的新方法,该方法可以近似谓词条件下的覆盖保证,并同时保持紧凑的预测集合。CondKGCP通过合并具有相似向量表示的谓词并利用排名信息进行校准来实现这一目标。我们证明了CondKGCP的理论保证并通过全面评估展示了其在实际应用中的有效性。
https://arxiv.org/abs/2505.16877
We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly $n$ tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library.
我们将嵌套命名实体识别(NNER)视为一个序列标注任务,通过利用先前将构成结构线性化的研究成果,这种方法有效地简化了这种结构化预测问题,并将其转化为简单的标记分类。结合这些构成线性化和预训练编码器,我们的方法在执行精确的$n$个标签操作的同时捕捉到了嵌套实体。与效率较低的系统相比,我们的方法取得了具有竞争力的表现,并且可以使用任何现成的序列标注库进行训练。
https://arxiv.org/abs/2505.16855
Reservoir Computing (RC) with physical systems requires an understanding of the underlying structure and internal dynamics of the specific physical reservoir. In this study, physical nano-electronic networks with neuromorphic dynamics are investigated for their use as physical reservoirs in an RC framework. These neuromorphic networks operate as dynamic reservoirs, with node activities in general coupled to the edge dynamics through nonlinear nano-electronic circuit elements, and the reservoir outputs influenced by the underlying network connectivity structure. This study finds that networks with varying degrees of sparsity generate more useful nonlinear temporal outputs for dynamic RC compared to dense networks. Dynamic RC is also tested on an autonomous multivariate chaotic time series prediction task with networks of varying densities, which revealed the importance of network sparsity in maintaining network activity and overall dynamics, that in turn enabled the learning of the chaotic Lorenz63 system's attractor behavior.
基于物理系统的液态计算(Reservoir Computing,RC)要求理解特定物理蓄水池的底层结构和内部动态。在这项研究中,探讨了具有神经形态动力学特性的纳米电子网络在RC框架中的应用潜力。这些神经形态网络作为动态蓄水池运行,节点活动通常通过非线性纳米电子电路元件与边缘动力学耦合,并且蓄水池输出受底层网络连接结构的影响。 研究发现,稀疏程度不同的网络相比于密集型网络能够生成更多有用的非线性时间序列输出,这对于动态RC尤为重要。此外,还测试了在自主多变量混沌时间序列预测任务中不同密度的网络对于动态RC的作用,这揭示了网络稀疏度在网络活动和整体动力学维持中的重要性,并进一步使得复杂系统的混沌吸引子行为(如洛伦兹63系统)得以学习和理解。
https://arxiv.org/abs/2505.16813
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
将视觉-语言模型(VLMs)集成到自主驾驶系统中,展示了在解决学习复杂性、可解释性和常识推理等关键挑战方面的潜力。然而,现有的方法常常因计算需求而难以实现高效整合和实时决策。为此,本文介绍了SOLVE框架,它通过结合VLM与端到端(E2E)模型来增强自主车辆的规划能力。我们的方法强调在特征级别上通过共享视觉编码器进行知识分享,从而使VLM和E2E组件能够全面互动。 我们提出了一种轨迹链式思考(T-CoT)范例,该范例逐步细化轨迹预测,减少不确定性并提高准确性。SOLVE采用时间解耦策略实现高效合作,在确保高质量的VLM输出的同时,也能达到E2E模型的实时性能要求。在nuScenes数据集上进行评估后,我们的方法显著提高了轨迹预测的准确性,为构建更稳健和可靠的自主驾驶系统铺平了道路。
https://arxiv.org/abs/2505.16805
Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored despite their importance in many real-world settings. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on common multimodal datasets demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of new samples in comparison to those relying solely on unimodal information, entropy guidance, and random selections. Our work provides an effective solution for optimizing modality acquisition at the cohort level, enabling better utilization of resources in constrained settings.
实际中的机器学习应用经常需要处理来自多种模态的数据,这些数据必须有效整合才能做出稳健的预测。然而,在许多现实场景中,并非每个样本都能获取所有模态的数据,而获得额外模态的数据可能成本高昂。这引出了一个问题:当资源有限时,应该优先为哪些样本增加额外的数据模态采集?尽管先前的工作已经探索了个体级别的数据采集策略和训练时间的主动学习范式,但在测试时间和基于群体(cohort-based)的数据采集方面研究仍然不足,尽管这些方法在许多现实场景中非常重要。我们引入了一种新的测试时间设置——Cohort-based Active Modality Acquisition (CAMA),用于正式化选择哪些样本应接受额外模态数据采集的挑战。 我们提出了一种结合生成式填补和判别模型的方法来估算获取缺失模态所带来的预期收益,基于常见的评估指标。我们也引入了上限启发式方法,为基准测试提供性能天花板。在常见多模态数据集上的实验表明,与仅依赖单模态信息、熵指导以及随机选择相比,我们提出的填补策略更有效地指导新样本的采集。 我们的工作提供了一种针对群体层面优化模态采集的有效解决方案,在资源受限的情况下能够更好地利用资源。
https://arxiv.org/abs/2505.16791
Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
掩码扩散模型(MDMs)在离散数据建模方面取得了显著进展,然而其在分子生成中的潜力尚未被充分探索。在这项工作中,我们探讨了它们的潜力,并引入了一个令人惊讶的结果:直接应用标准的MDMs会严重降低性能。我们将这一问题的关键原因归结为“状态冲突”问题——即不同分子的前向扩散过程最终收敛到同一个状态,这导致了一种混合重构目标,无法通过典型的单模逆向扩散过程进行学习。为解决这个问题,我们提出了掩码元素级可学习扩散(MELD),该方法通过为每个图元(如原子和键)分配不同的污染率来协调各个元素的腐败轨迹,从而避免不同分子图之间的冲突。这一机制由一个参数化的噪声调度网络实现。 在多种分子基准测试中进行广泛实验后发现,与不区分元素的噪声调度相比,MELD显著提高了整体生成质量。具体而言,它将纯MDMs模型在ZINC250K数据集上的化学有效性从15%提升到了93%,同时还在条件生成任务中达到了最先进的属性对齐水平。
https://arxiv.org/abs/2505.16790
Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. MBRL methods have progressed by largely prioritising the actor; optimising the world model learning has been neglected meanwhile. Improving the fidelity of the world model and reducing its time to convergence can yield significant downstream benefits, one of which is improving the ensuing performance of any actor it may train. We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions generated by the world model, offering a principled alternative to traditional curiosity-driven methods that chase once-novel states well after they were stumbled into. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multi step plans after every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to just Dreamer as a proof of concept. Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.
基于模型的强化学习(MBRL)提供了一种直观的方法,可以通过同时训练一个能够预测未来的世界模型来提高无模型RL方法的样本效率。虽然MBRL方法主要侧重于优化演员(actor),但忽略了对世界模型的学习优化。通过改进世界模型的真实度并减少其收敛时间,可以带来显著的好处之一就是提升它所训练出的任何代理(agent)的表现性能。我们提出了一种新颖的方法,该方法能够预测由世界模型生成的短时地平线隐式状态,并主动寻找高熵状态,这为传统的好奇心驱动方法提供了一个原理性的替代方案,后者会在很久之后才追逐那些偶然发现的新颖状态。尽管许多基于模型预测控制(MPC)的方法提供了类似的替代方案,但它们通常缺乏一致性,在每一步后都会合成多步计划。为了缓解这一问题,我们提出了一种分层规划者,它可以动态地决定何时重新规划、规划的地平线长度以及奖励和熵之间的权重分配。 虽然我们的方法理论上可以应用于任何能够仅通过模型生成的数据训练自己代理的模型上,但在本次研究中,我们仅将其应用到Dreamer作为概念验证。与基准Dreamer相比,使用这种方法完成Miniworld程序生成迷宫的速度提高了50%,并且在想象空间内进行的策略训练只需60%的基础Dreamer所需环境步骤数即可收敛。
https://arxiv.org/abs/2505.16787
We explore the use of conformal prediction to provide statistical uncertainty guarantees for runway detection in vision-based landing systems (VLS). Using fine-tuned YOLOv5 and YOLOv6 models on aerial imagery, we apply conformal prediction to quantify localization reliability under user-defined risk levels. We also introduce Conformal mean Average Precision (C-mAP), a novel metric aligning object detection performance with conformal guarantees. Our results show that conformal prediction can improve the reliability of runway detection by quantifying uncertainty in a statistically sound way, increasing safety on-board and paving the way for certification of ML system in the aerospace domain.
我们探讨了使用符合预测(conformal prediction)为基于视觉着陆系统的跑道检测提供统计不确定性保证的方法。通过在航拍图像上对YOLOv5和YOLOv6模型进行微调,我们将符合预测应用于量化用户定义的风险水平下的定位可靠性。此外,我们还引入了一种新的度量标准——符合平均精度(Conformal mean Average Precision, C-mAP),该指标将对象检测性能与符合保证相结合。我们的研究结果表明,通过使用统计方法量化不确定性,符合预测可以提高跑道检测的可靠性,从而增强机上安全性,并为航空航天领域中机器学习系统的认证铺平道路。
https://arxiv.org/abs/2505.16740
Concept bottleneck models (CBMs) ensure interpretability by decomposing predictions into human interpretable concepts. Yet the annotations used for training CBMs that enable this transparency are often noisy, and the impact of such corruption is not well understood. In this study, we present the first systematic study of noise in CBMs and show that even moderate corruption simultaneously impairs prediction performance, interpretability, and the intervention effectiveness. Our analysis identifies a susceptible subset of concepts whose accuracy declines far more than the average gap between noisy and clean supervision and whose corruption accounts for most performance loss. To mitigate this vulnerability we propose a two-stage framework. During training, sharpness-aware minimization stabilizes the learning of noise-sensitive concepts. During inference, where clean labels are unavailable, we rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility. Theoretical analysis and extensive ablations elucidate why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts, providing a principled basis that preserves both interpretability and resilience in the presence of noise.
概念瓶颈模型(CBMs)通过将预测分解为人可解释的概念来确保可解释性。然而,用于训练这些模型的标注数据常常包含噪声,并且这种污染的影响尚未被充分理解。在这项研究中,我们首次系统地研究了CBM中的噪声问题,并表明即使是适度的腐败也会同时损害预测性能、可解释性和干预效果的有效性。我们的分析识别出了一组特别脆弱的概念,它们的准确性下降幅度远超过噪声和干净监督之间的平均差距,而且这种污染导致了大部分性能损失。 为了减轻这一弱点,我们提出了一种两阶段框架。在训练阶段,通过使用“尖锐度感知最小化”(sharpness-aware minimization)来稳定学习对噪声敏感的概念。而在推理阶段,当没有干净的标签时,我们会根据预测熵对概念进行排名,并仅修正最不确定的概念,以此作为脆弱性的代理指标。 理论分析和广泛的消融实验阐明了为什么“尖锐度感知训练”可以提供鲁棒性,以及为什么不确定性能够可靠地识别出脆弱的概念。这为在存在噪声的情况下同时保持可解释性和鲁棒性提供了原则基础。
https://arxiv.org/abs/2505.16705
Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.
基于Transformer的语言模型展示了情境学习(ICL)能力,即根据上下文自适应地进行预测。以往的研究通过准确性突然跃升将归纳头部与ICL联系起来,但这仅能解释当答案包含在上下文中时的ICL现象。然而,在大型语言模型中实际使用的ICL的一个重要特性是能够从情境中学到如何解决任务,而不仅仅是从上下文中复制答案;这种能力是如何在训练过程中获得的仍然很大程度上未被探索。 在这篇论文中,我们通过分析模型电路在训练过程中的动态变化来实验性地阐明了这种元学习能力是如何获取的。具体来说,我们将先前研究中的复制任务扩展到了情境元学习设置,在这种设置下,模型必须从示例推断出任务以回答查询。有趣的是,在这个设置中,我们发现获得这种能力的过程经历了多个阶段,并且在每个阶段都会出现独特的电路结构,这与归纳头部单一阶段的变化不同。这些电路的形成可以关联到大型语言模型中已知的一些现象,我们的分析为理解Transformer的ICL能力来源提供了更深入的理解。
https://arxiv.org/abs/2505.16694
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
大型语言模型的后期训练对于将预训练的语言模型(PLM)与人类偏好和下游任务对齐至关重要。尽管预训练的语言模型通常表现出良好的置信度校准,但经过后期训练的语言模型(PoLMs)常常会出现过度自信的问题,即在正确输出和错误输出上都赋予了过高的置信度,这可能会影响其在关键应用中的可靠性。校准PoLM的一个主要障碍是为特定的下游任务获取标注数据极为困难。 为了应对这一挑战,我们提出了一种新的无监督方法——不一致感知置信对齐(DACA),用于优化后期自信校准过程中的参数(如温度$\tau$)。我们的方法基于这样一个动机:当PLM和PoLM在预测中出现分歧时,后者会出现低估自身准确性的现象。理论上,在通过调整温度来校准时,这种低估会导致较大的$\tau$值,并产生过度保守的预测。 DACA通过仅使用一致样本进行校准来缓解这一问题,从而有效地解耦了不一致对校准的影响。这种方法避免了在温度缩放过程中由于不一致样本导致的过大的$\tau$值,从而提高了整体的校准性能。 广泛的实验结果证明了我们方法的有效性,在常见的基准测试中将开源和API基础的大规模语言模型(如GPT-4o)的平均ECE改进高达15.08%。
https://arxiv.org/abs/2505.16690
Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal processing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed based on current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) modules. This architecture is designed to capture both local signal characteristics and long-range temporal dependencies while modeling the continuous-time dynamics of battery degradation. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available large-scale datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its strong potential for real-world RUL prediction applications.
准确预测剩余使用寿命(RUL)对于及时维护锂离子电池至关重要,这会影响依赖这些电池的电动应用的操作效率。本文提出了一种基于最近充放电循环数据来估算剩余可用循环数的RUL预测方法。该方法引入了一个新颖的信号处理管道和一个深度学习预测模型。 在信号预处理管道中,根据电流和容量信号计算出衍生容量特征。与原始容量、电压和电流一起,这些特征通过统计指标和基于增量的方法进行去噪和增强,以捕捉当前循环与前一循环之间的差异。 在预测模型中,经过处理的特征被输入到一个混合深度学习架构中,该架构由1D卷积神经网络(CNN)、注意力长短期记忆(A-LSTM)模块以及基于常微分方程的LSTM(ODE-LSTM)模块组成。这种架构设计旨在捕获局部信号特征和长期时间依赖关系,并建模电池退化过程中的连续时间动态。 该模型通过不同的学习策略和目标数据分区场景进行迁移学习进一步进行了评估,结果表明即使在有限的目标数据上微调的情况下也能保持稳健的性能。 实验结果基于两个公开的大规模数据集,证明了所提出的方法优于深度学习基线方法和机器学习技术,在RMSE(均方根误差)方面取得了101.59的成绩,突显其在实际RUL预测应用中的强大潜力。
https://arxiv.org/abs/2505.16664
Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.
依托于庞大的内部知识库,新一代的大规模语言模型(LLM)展示了处理医疗任务的潜力。然而,在调动多个LLM的专业知识和背景以形成协同效应方面,目前的努力尚显不足。本研究提出了一种基于医学多选题数据集的多LLM协作框架。通过对三个预训练的LLM进行事后分析,我们的框架被证明可以增强所有LLM的推理能力,并缓解它们在不同问题上的差异。此外,我们还测量了当一个LLM面对其他LLM提出的对抗性意见时的信心水平,并观察到了LLM信心与其预测准确度之间的契合关系。
https://arxiv.org/abs/2505.16648