Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.
基础模型通过大规模的预训练来捕获广泛的知识,并在各种语言任务中表现出泛化能力。相比之下,尽管投入了大量计算资源,视觉基础模型(VFMs)在其下游任务中的改进往往不均衡。我们认为这种限制源于预训练目标与下游视觉和成像任务需求之间的不匹配。例如,掩码图像重建或对比学习等预训练策略塑造的是用于恢复通用视觉模式或全局语义结构的表示形式,这些可能并不符合包括分割、分类或图像合成在内的特定任务的需求。 为了在具体的现实世界临床领域探讨这一问题,我们评估了两个VFMs——一个侧重于重建的MAE基模型(ProFound)和一个基于对比学习的模型(ProViCNet),对五种前列腺多参数MR成像任务进行测试。我们检查了这种任务匹配如何影响迁移性能,即从预训练到微调的过程中的表现。 我们的研究结果表明,通过简单的分歧度量如最大均值差异(MMD)来衡量的预训练与下游任务之间的更好对齐,会导致更大的性能提升和更快的收敛速度。这强调了在设计和分析预训练目标时考虑其对下游应用的影响的重要性。
https://arxiv.org/abs/2601.15888
We present a scalable framework for cross-embodiment humanoid robot control by learning a shared latent representation that unifies motion across humans and diverse humanoid platforms, including single-arm, dual-arm, and legged humanoid robots. Our method proceeds in two stages: first, we construct a decoupled latent space that captures localized motion patterns across different body parts using contrastive learning, enabling accurate and flexible motion retargeting even across robots with diverse morphologies. To enhance alignment between embodiments, we introduce tailored similarity metrics that combine joint rotation and end-effector positioning for critical segments, such as arms. Then, we train a goal-conditioned control policy directly within this latent space using only human data. Leveraging a conditional variational autoencoder, our policy learns to predict latent space displacements guided by intended goal directions. We show that the trained policy can be directly deployed on multiple robots without any adaptation. Furthermore, our method supports the efficient addition of new robots to the latent space by learning only a lightweight, robot-specific embedding layer. The learned latent policies can also be directly applied to the new robots. Experimental results demonstrate that our approach enables robust, scalable, and embodiment-agnostic robot control across a wide range of humanoid platforms.
我们提出了一种可扩展的框架,用于通过学习一个统一人类和各种人形机器人(包括单臂、双臂和腿足型机器人)之间运动的共享潜在表示来实现跨身体形式的人形机器人控制。我们的方法分为两个阶段:首先,我们使用对比学习构建了一个解耦的潜在空间,该空间捕捉了不同身体部位局部运动模式,并且即使在形态各异的机器人之间也能进行准确、灵活的动作重定位。为了增强不同身体形式之间的对齐度,我们引入了定制的相似性指标,这些指标结合关节旋转和末端执行器位置来优化关键部分(如手臂)的表现。 接着,在这个潜在空间内,我们仅使用人类数据直接训练了一个基于目标的控制策略。通过条件变分自动编码器,我们的政策学会了预测在预期目标方向指导下潜空间中的位移变化。我们证明了经过训练的策略可以直接部署到多个机器人上而无需任何调整。 此外,我们的方法支持高效地将新机器人添加到潜在空间中,仅需学习一个轻量级、特定于机器人的嵌入层即可。所学得的潜在政策可以直接应用于新的机器人。 实验结果表明,我们提出的方法能够实现稳健、可扩展且不受身体形式限制的人形平台机器人控制。
https://arxiv.org/abs/2601.15419
This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.
本文介绍了一种先进的视觉编码器家族,名为OpenVision 3,它能够学习一种单一的、统一的视觉表示形式,既能服务于图像理解又能支持图像生成。我们的核心架构非常简单:我们将VAE压缩后的图像潜在特征输入到ViT编码器中,并训练其输出以承担两种互补的角色。首先,编码器的输出被传递给ViT-VAE解码器来重建原始图像,促使表示形式捕捉生成结构。其次,相同的表示形式通过对比学习和图文配对的目标进行优化,强化语义特征。通过在共享潜在空间中共同优化重构驱动信号和语义驱动信号,编码器学会了能很好地协同工作并推广到两个领域的表示方式。 我们通过对冻结编码器后的大量下游任务的评估来验证这种统一的设计理念。对于多模态理解,我们将编码器集成到了LLaVA-1.5框架中:它的性能与标准CLIP视觉编码器相当(例如,在SeedBench上分别为62.4和62.2,在POPE上的得分分别为83.7和82.9)。在生成方面,我们使用RAE框架对其进行测试:我们的模型显著超过了基于标准CLIP的编码器(例如,在ImageNet上的gFID得分为1.89对2.54)。 希望这项工作能够激发未来关于统一建模的研究。
https://arxiv.org/abs/2601.15369
The requirement for expert annotations limits the effectiveness of deep learning for medical image analysis. Although 3D self-supervised methods like volume contrast learning (VoCo) are powerful and partially address the labeling scarcity issue, their high computational cost and memory consumption are barriers. We propose 2D-VoCo, an efficient adaptation of the VoCo framework for slice-level self-supervised pre-training that learns spatial-semantic features from unlabeled 2D CT slices via contrastive learning. The pre-trained CNN backbone is then integrated into a CNN-LSTM architecture to classify multi-organ injuries. In the RSNA 2023 Abdominal Trauma dataset, 2D-VoCo pre-training significantly improves mAP, precision, recall, and RSNA score over training from scratch. Our framework provides a practical method to reduce the dependency on labeled data and enhance model performance in clinical CT analysis. We release the code for reproducibility. this https URL
医学图像分析中对专家注释的需求限制了深度学习的有效性。尽管3D自监督方法(如体积对比学习VoCo)非常强大,并在一定程度上解决了标签稀缺问题,但其高昂的计算成本和内存消耗成为应用障碍。我们提出了一种名为2D-VoCo的方法,这是一种高效的VoCo框架适应版本,用于基于未标记的2D CT切片进行自我监督预训练,通过对比学习来获取空间语义特征。接着,我们将预训练好的CNN骨干网络集成到CNN-LSTM架构中以分类多器官损伤情况。在RSNA 2023腹部创伤数据集中,与从零开始的训练相比,使用2D-VoCo进行预训练显著提高了mAP、精确度、召回率和RSNA评分。 我们的框架提供了一种实用的方法来减少对标注数据的依赖,并提升临床CT分析中的模型性能。为了保证研究的可复现性,我们发布了相关代码:[此链接](https://this.url.com)
https://arxiv.org/abs/2601.14593
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.
图像表示学习的模型通常设计用于识别或生成任务。对比学习的各种形式帮助模型学习将图像转换为对分类、检测和分割有用的嵌入表示。另一方面,通过像素级、感知性和对抗性损失函数训练模型来重构图像,从而学习到对于图像生成有用的潜在空间。我们试图用一种前所未有的模型统一这两种方向,该模型能够同时学习出对识别和生成都有用的表示。 我们的模型被训练为隐式神经表示(INR)超网络,它学会将图像映射到模型权重上以实现快速准确地重构。此外,我们将INR超网络与知识蒸馏技术集成在一起,从而提高其泛化能力和性能表现。除了创新的训练设计之外,该模型还学习出了一个前所未有的压缩嵌入空间,在各种视觉任务中表现出色。 整个模型在图像表示学习领域取得了接近或达到现有最佳水平的结果,并且由于其高质量的小型嵌入,还能具备生成能力。代码可在提供的链接处获取(原文中的“this https URL”)。
https://arxiv.org/abs/2601.14256
The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non-stationarity of EEG signals where statistical properties change over time. To address this, we propose RL-BioAug, a framework that leverages a label-efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10\%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self-supervised manner. Experimental results demonstrate that RL-BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69\% and 8.80\% in Macro-F1 score on the Sleep-EDFX and CHB-MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task -- for example, Time Masking with a 62\% probability for sleep stage classification and Crop \& Resize with a 77\% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic-based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at \href{this https URL}{this https URL}.
数据增强的质量是脑电图(EEG)任务中对比学习性能的关键决定因素。虽然这一范式有潜力利用未标记的数据,但静态或随机的增强策略由于EEG信号随时间变化的非平稳性而常常无法保留内在信息。为了解决这个问题,我们提出了RL-BioAug框架,该框架利用一种标签高效的强化学习(RL)代理自主确定最佳增广策略。仅使用少量标记数据(10%)指导代理政策的情况下,我们的方法使编码器能够在完全自我监督的方式下学习稳健的表示形式。实验结果表明,与随机选择策略相比,RL-BioAug显著提高了性能,在Sleep-EDFX和CHB-MIT数据集上的宏平均F1分数分别提升了9.69%和8.80%。值得注意的是,该代理主要为每个任务选择了最佳策略——例如,在睡眠分期分类中,时间掩码的概率高达62%,而在癫痫发作检测中,“裁剪与重缩放”操作的概率为77%。我们的框架表明其可能取代传统的基于启发式的增强方法,并建立了一种新的自主数据增广范式。源代码可在此链接获取:[此URL](this https URL)。
https://arxiv.org/abs/2601.13964
Knowledge Graphs~(KGs) often suffer from unreliable knowledge, which restricts their utility. Triple Classification~(TC) aims to determine the validity of triples from KGs. Recently, text-based methods learn entity and relation representations from natural language descriptions, significantly improving the generalization capabilities of TC models and setting new benchmarks in performance. However, there are still two critical challenges. First, existing methods often ignore the effective semantic interaction among different KG components. Second, most approaches adopt single binary classification training objective, leading to insufficient semantic representation learning. To address these challenges, we propose \textbf{SASA}, a novel framework designed to enhance TC models via separated attention mechanism and semantic-aware contrastive learning~(CL). Specifically, we first propose separated attention mechanism to encode triples into decoupled contextual representations and then fuse them through a more effective interactive way. Then, we introduce semantic-aware hierarchical CL as auxiliary training objective to guide models in improving their discriminative capabilities and achieving sufficient semantic learning, considering both local level and global level CL. Experimental results across two benchmark datasets demonstrate that SASA significantly outperforms state-of-the-art methods. In terms of accuracy, we advance the state-of-the-art by +5.9\% on FB15k-237 and +3.4\% on YAGO3-10.
知识图谱(KGs)常常受到不可靠信息的影响,这限制了它们的实用性。三元组分类(TC)的目标是确定来自知识图谱的三元组的有效性。最近,基于文本的方法通过从自然语言描述中学习实体和关系表示,显著提高了TC模型的泛化能力,并在性能上设立了新的基准。然而,仍存在两个关键挑战:首先,现有方法常常忽视了不同KG组件之间的有效语义交互;其次,大多数方法采用单一的二元分类训练目标,导致语义表征学习不足。为了应对这些挑战,我们提出了\textbf{SASA},这是一个通过分离注意力机制和语义感知对比学习(CL)来增强TC模型的新框架。具体来说,首先,我们提出了一种分离注意机制,用于将三元组编码为解耦的上下文表征,并以更有效的方式融合它们。其次,我们引入了层次化的、基于语义的对比学习作为辅助训练目标,以引导模型提高其判别能力和实现足够的语义学习,同时考虑局部和全局层面的CL。在两个基准数据集上的实验结果表明,SASA显著优于现有的最先进方法,在准确性方面分别提高了FB15k-237 +5.9% 和 YAGO3-10 +3.4%。
https://arxiv.org/abs/2601.13035
Surgical action triplet recognition aims to understand fine-grained surgical behaviors by modeling the interactions among instruments, actions, and anatomical targets. Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the semantic interdependence among triplet components. Existing approaches often address only a subset of these challenges rather than tackling them jointly, which limits their ability to form a holistic understanding. This study builds upon CurConMix, a spatial representation framework. At its core, a curriculum-guided contrastive learning strategy learns discriminative and progressively correlated features, further enhanced by structured hard-pair sampling and feature-level mixup. Its temporal extension, CurConMix+, integrates a Multi-Resolution Temporal Transformer (MRTT) that achieves robust, context-aware understanding by adaptively fusing multi-scale temporal features and dynamically balancing spatio-temporal cues. Furthermore, we introduce LLS48, a new, hierarchically annotated benchmark for complex laparoscopic left lateral sectionectomy, providing step-, task-, and action-level annotations. Extensive experiments on CholecT45 and LLS48 demonstrate that CurConMix+ not only outperforms state-of-the-art approaches in triplet recognition, but also exhibits strong cross-level generalization, as its fine-grained features effectively transfer to higher-level phase and step recognition tasks. Together, the framework and dataset provide a unified foundation for hierarchy-aware, reproducible, and interpretable surgical workflow understanding. The code and dataset will be publicly released on GitHub to facilitate reproducibility and further research.
手术动作三元组识别旨在通过建模器械、行为和解剖目标之间的交互,理解精细的外科操作。尽管这对于工作流程分析和技能评估具有重要的临床意义,但由于严重的类别不平衡、细微的视觉变化以及三元组组件间的语义依赖性等挑战,进展一直受限。现有的方法通常只能解决这些挑战中的一部分而非整体应对,从而限制了其形成全面理解的能力。 本研究建立在CurConMix框架的基础上,这是一个空间表示框架。该框架的核心是一个以课程指导对比学习策略为核心的系统,它能学习出区分性强且逐步关联的特征,并通过结构化的难对采样和特征级混合进一步增强这些特性。它的时序扩展版本 CurConMix+ 则集成了一个多分辨率时间变换器(MRTT),能够通过自适应融合多尺度时序特征并动态平衡空间-时间线索,从而实现鲁棒且上下文感知的理解。 此外,我们引入了LLS48,这是一个新的、分层标注的基准数据集,用于复杂的腹腔镜左侧外侧部分切除术,并提供步骤级、任务级和动作级注释。在CholecT45 和 LLS48 数据集上的广泛实验表明,CurConMix+ 不仅在三元组识别上超越了现有的最佳方法,而且还展示了强大的跨层次泛化能力,其细化特征能有效地迁移到更高层次的阶段和步骤识别任务中。 总的来说,该框架与数据集共同提供了一个统一的基础,用于层级感知、可重复且解释性强的手术工作流程理解。代码和数据集将通过GitHub公开发布,以促进可再现性和进一步的研究。
https://arxiv.org/abs/2601.12312
Digital pathology plays a vital role across modern medicine, offering critical insights for disease diagnosis, prognosis, and treatment. However, histopathology images often contain artifacts introduced during slide preparation and digitization. Detecting and excluding them is essential to ensure reliable downstream analysis. Traditional supervised models typically require large annotated datasets, which is resource-intensive and not generalizable to novel artifact types. To address this, we propose DiffusionQC, which detects artifacts as outliers among clean images using a diffusion model. It requires only a set of clean images for training rather than pixel-level artifact annotations and predefined artifact types. Furthermore, we introduce a contrastive learning module to explicitly enlarge the distribution separation between artifact and clean images, yielding an enhanced version of our method. Empirical results demonstrate superior performance to state-of-the-art and offer cross-stain generalization capacity, with significantly less data and annotations.
数字病理学在现代医学中发挥着至关重要的作用,为疾病的诊断、预后和治疗提供了关键的见解。然而,组织病理图像经常包含在载玻片制备和数字化过程中引入的伪影。检测并排除这些伪影对于确保可靠的下游分析至关重要。传统的监督模型通常需要大量的注释数据集,这既耗资源又不具备泛化到新型伪影类型的能力。 为了解决这些问题,我们提出了DiffusionQC方法,该方法利用扩散模型在干净图像中识别伪影作为异常值,并且只需要一组干净的图像进行训练,而不需要像素级别的伪影注释和预定义的伪影类型。此外,我们引入了一个对比学习模块,以明确扩大伪影与干净图像之间的分布差异,从而增强了我们的方法。 实证结果表明,该方法在性能上超越了现有技术,并且具备跨染色类型的泛化能力,在数据量和标注需求方面也显著减少。
https://arxiv.org/abs/2601.12233
The widespread proliferation of online content has intensified concerns about clickbait, deceptive or exaggerated headlines designed to attract attention. While Large Language Models (LLMs) offer a promising avenue for addressing this issue, their effectiveness is often hindered by Sycophancy, a tendency to produce reasoning that matches users' beliefs over truthful ones, which deviates from instruction-following principles. Rather than treating sycophancy as a flaw to be eliminated, this work proposes a novel approach that initially harnesses this behavior to generate contrastive reasoning from opposing perspectives. Specifically, we design a Self-renewal Opposing-stance Reasoning Generation (SORG) framework that prompts LLMs to produce high-quality agree and disagree reasoning pairs for a given news title without requiring ground-truth labels. To utilize the generated reasoning, we develop a local Opposing Reasoning-based Clickbait Detection (ORCD) model that integrates three BERT encoders to represent the title and its associated reasoning. The model leverages contrastive learning, guided by soft labels derived from LLM-generated credibility scores, to enhance detection robustness. Experimental evaluations on three benchmark datasets demonstrate that our method consistently outperforms LLM prompting, fine-tuned smaller language models, and state-of-the-art clickbait detection baselines.
在线内容的广泛传播加剧了人们对“标题党”(clickbait)的担忧,这种做法通过设计诱人的、具有欺骗性或夸大的标题来吸引注意力。虽然大型语言模型(LLMs)为解决这一问题提供了有前景的方法,但它们的有效性往往因阿谀奉承行为而受阻——倾向于生成与用户信念相符而非真实性的推理,这偏离了遵循指令的原则。与其将这种倾向视为需要消除的缺陷,本工作提出了一种新颖的方法,首先利用这种行为来从对立视角生成对比推理。具体而言,我们设计了一个自我更新反向立场推理生成(Self-renewal Opposing-stance Reasoning Generation, SORG)框架,该框架能够促使LLMs在没有地面真实标签的情况下为给定的新闻标题生成高质量的一致和反对推理对。 为了利用所生成的推理,我们开发了一种基于对立理由的点击诱饵检测(Opposing Reasoning-based Clickbait Detection, ORCD)模型。该模型整合了三个BERT编码器来表示标题及其相关的推理内容。通过软标签(由LLM生成的可信度评分提供),结合对比学习,增强检测鲁棒性。 在三项基准数据集上的实验评估表明,我们的方法在性能上持续优于基于提示的语言模型、经过微调的小型语言模型和最先进的点击诱饵检测基线。
https://arxiv.org/abs/2601.12019
The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot's decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM's candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.
自主机器人车辆的安全验证依赖于对其规划和控制系统进行系统性测试,尤其是在罕见的、关键安全场景下。因此,从大规模的真实世界驾驶日志中挖掘这些长尾事件是机器人开发生命周期中的一个关键步骤。情景挖掘任务的目标是从自然语言描述的情景中检索有用的信息,以支持针对特定目标的重新模拟、回归测试和故障分析。 RefAV是由Argoverse团队提出的一种端到端框架,该框架使用大型语言模型(LLMs)来空间和时间定位用自然语言描述的情景。然而,这一过程基于轨迹标签进行检索,忽略了自然语言与原始RGB图像之间的直接联系,这违背了视频检索的直觉;此外,它还依赖于上游3D目标检测和跟踪的质量。不准确的轨迹数据会导致下游的空间和时间定位不准确。 为了解决这些问题,我们提出了从粗到细的情景挖掘方法(SMc2f),这是一种采用视觉-语言模型(VLMs)进行图像文本过滤、在RefAV基础上构建成功挖掘案例数据库并自动检索实例以对LLM进行少样本条件化从而提高检索鲁棒性,并引入了基于文本和轨迹的对比学习,将匹配对拉近并在共享嵌入空间中将不匹配对推开的方法。这种方法最终形成了一种细粒度匹配器,用于精炼LLM候选轨迹。 在公共数据集上的实验表明,在检索质量和效率方面取得了显著提升。
https://arxiv.org/abs/2601.12010
This paper presents a semantic course recommendation system for students using a self-supervised contrastive learning approach built upon BERT (Bidirectional Encoder Representations from Transformers). Traditional BERT embeddings suffer from anisotropic representation spaces, where course descriptions exhibit high cosine similarities regardless of semantic relevance. To address this limitation, we propose a contrastive learning framework with data augmentation and isotropy regularization that produces more discriminative embeddings. Our system processes student text queries and recommends Top-N relevant courses from a curated dataset of over 500 engineering courses across multiple faculties. Experimental results demonstrate that our fine-tuned model achieves improved embedding separation and more accurate course recommendations compared to vanilla BERT baselines.
本文提出了一种使用自监督对比学习方法(基于BERT,即Transformer的双向编码器表示)为学生推荐语义课程的系统。传统的BERT嵌入在各向异性表征空间中存在问题,这意味着无论是否具有语义相关性,课程描述都表现出很高的余弦相似度。为了克服这一限制,我们提出了一种结合数据增强和等向性正则化的对比学习框架,以生成更具区分性的嵌入。 我们的系统处理学生文本查询,并从一个包含超过500门工程学科课程的精心整理的数据集中推荐Top-N相关课程(涵盖多个院系)。实验结果表明,与传统的BERT基线模型相比,我们微调后的模型在嵌入分离和课程推荐准确度方面取得了改进。
https://arxiv.org/abs/2601.11427
Multimedia recommendation systems leverage user-item interactions and multimodal information to capture user preferences, enabling more accurate and personalized recommendations. Despite notable advancements, existing approaches still face two critical limitations: first, shallow modality fusion often relies on simple concatenation, failing to exploit rich synergic intra- and inter-modal relationships; second, asymmetric feature treatment-where users are only characterized by interaction IDs while items benefit from rich multimodal content-hinders the learning of a shared semantic space. To address these issues, we propose a Cross-modal Recursive Attention Network with dual graph Embedding (CRANE). To tackle shallow fusion, we design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space, effectively capturing high-order intra- and inter-modal dependencies. For symmetric multimodal learning, we explicitly construct users' multimodal profiles by aggregating features of their interacted items. Furthermore, CRANE integrates a symmetric dual-graph framework-comprising a heterogeneous user-item interaction graph and a homogeneous item-item semantic graph-unified by a self-supervised contrastive learning objective to fuse behavioral and semantic signals. Despite these complex modeling capabilities, CRANE maintains high computational efficiency. Theoretical and empirical analyses confirm its scalability and high practical efficiency, achieving faster convergence on small datasets and superior performance ceilings on large-scale ones. Comprehensive experiments on four public real-world datasets validate an average 5% improvement in key metrics over state-of-the-art baselines.
多媒体推荐系统通过利用用户-物品交互和多模态信息来捕捉用户的偏好,从而实现更准确和个人化的推荐。尽管已经取得显著进展,现有方法仍然面临两个关键限制:首先,浅层的模式融合往往依赖于简单的拼接操作,无法充分利用丰富的跨模态关系;其次,在特征处理上的不对称性——用户仅通过交互ID进行表征而物品则得益于丰富多样的内容信息——阻碍了共享语义空间的学习。为了解决这些问题,我们提出了一种双图嵌入的交叉模式递归注意网络(CRANE)。 针对浅层融合的问题,我们设计了一个核心的递归跨模态注意力机制(RCA),该机制基于联合潜在空间中的互相关性迭代地细化模态特征,有效地捕捉高阶的内在和跨模态依赖关系。为了解决不对称多模式学习问题,CRANE通过聚合用户交互物品的特征来显式构建用户的多模态配置文件。此外,CRANE整合了一个对称双图框架,该框架包括一个异构的用户-物品互动图和一个同质的物品-物品语义图,并通过自我监督对比目标统一这两种信号以融合行为和语义信息。 尽管具备复杂的建模能力,CRANE仍能保持较高的计算效率。理论分析和实证研究证明了其可扩展性和高实用效率,在小数据集上实现更快收敛,在大规模数据集上取得更高的性能上限。在四个公开的真实世界数据集上的全面实验验证了CRANE相较于最新基准线的关键指标平均提高了5%。
https://arxiv.org/abs/2601.11151
Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL'' paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM's causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict between generative and contrastive learning. Extensive experiments on medical, chemistry, and code retrieval tasks show that LBR significantly outperforms strong baselines. Our work establishes a new paradigm for building accurate and robust representations in vertical domains.
大型语言模型(LLMs)通过对比学习适应后在通用表示学习方面表现出色,但在化学和法律等垂直领域却面临挑战,主要是因为缺乏特定领域的知识。这项研究识别了一个核心瓶颈:当前流行的“LLM+CL”范式专注于语义对齐但无法进行知识获取,导致在专业术语上的表现不佳。为了弥合这一差距,我们提出了Learn Before Represent (LBR),这是一种新颖的两阶段框架。 LBR首先通过信息瓶颈约束下的生成学习阶段注入领域知识,在最大化知识获取的同时压缩语义并保持LLM的因果注意力。然后对压缩后的表示进行生成细化对比学习以实现对齐。这种方法在架构上保持了一致性,并解决了生成和对比学习目标之间的冲突问题。在医疗、化学以及代码检索任务上的大量实验表明,LBR显著优于强大的基线模型。我们的工作为构建垂直领域中准确且稳健的表示建立了新的范式。
https://arxiv.org/abs/2601.11124
Group anomaly detection is crucial in many network applications, but faces challenges due to diverse anomaly patterns. Motivated by the success of large language models (LLMs) in natural language processing, graph foundation models (GFMs) is proposed to handle few-shot learning task with fewer labeling efforts. GFMs have been successfully applied to detection of individual anomalies but cannot be generalized to group anomalies, as group anomaly patterns must be detected as a whole and individuals in an abnormal group can look rather normal. Therefore, we propose GFM4GA, a novel graph foundation model for group anomaly detection. The pipeline is pretrained via dual-level contrastive learning based on feature-based estimation and group extraction, to capture potential group anomaly structure and feature inconsistencies. In the downstream tasks, the pipeline is finetuned in parameter-constrained and group-anomaly-proportion weighted few-shot settings, and its adaptive ability to unseen group anomalies expanded via group contexts determined by labeled anomaly neighbors. Experiments show that GFM4GA surpasses group anomaly detectors and GFMs for individual anomalies, achieving average improvements of 2.85% in AUROC and 2.55% in AUPRC.
群体异常检测在许多网络应用中至关重要,但面临着由于多样化的异常模式而导致的挑战。受大型语言模型(LLM)在自然语言处理领域成功应用的启发,图基础模型(GFMs)被提出以减少标注工作量并解决少量样本学习任务。尽管GFMs已经在单个异常检测方面取得了成功应用,但由于群体异常模式必须作为一个整体进行检测且群体中的个体可能看起来很正常,它们无法推广到群体异常检测中去。 为此,我们提出了GFM4GA(Graph Foundation Model for Group Anomaly Detection),这是一种新型的图基础模型用于解决群体异常检测问题。该模型通过基于特征估计和群体提取的双层对比学习进行预训练,以捕捉潜在的群体异常结构和特征不一致。在下游任务中,模型在参数受限且考虑异常群比例权重的少量样本设置下进行微调,并通过标记为异常邻居所确定的群体上下文来扩展其对未知群体异常的适应能力。 实验表明,GFM4GA超越了现有的群体异常检测方法和用于单个异常检测的图基础模型,在AUROC(Area Under the Receiver Operating Characteristic curve)上平均提高了2.85%,在AUPRC(Area Under the Precision-Recall Curve)上平均提高了2.55%。
https://arxiv.org/abs/2601.10193
The effectiveness of contrastive learning methods has been widely recognized in the field of graph learning, especially in contexts where graph data often lack labels or are difficult to label. However, the application of these methods to node classification tasks still faces a number of challenges. First, existing data enhancement techniques may lead to significant differences from the original view when generating new views, which may weaken the relevance of the view and affect the efficiency of model training. Second, the vast majority of existing graph comparison learning algorithms rely on the use of a large number of negative samples. To address the above challenges, this study proposes a novel node classification contrast learning method called Simple Network Graph Comparative Learning (SNGCL). Specifically, SNGCL employs a superimposed multilayer Laplace smoothing filter as a step in processing the data to obtain global and local feature smoothing matrices, respectively, which are thus passed into the target and online networks of the siamese network, and finally employs an improved triple recombination loss function to bring the intra-class distance closer and the inter-class distance farther. We have compared SNGCL with state-of-the-art models in node classification tasks, and the experimental results show that SNGCL is strongly competitive in most tasks.
对比学习方法在图学习领域的有效性已被广泛认可,特别是在图数据常常缺乏标签或难以标注的情况下。然而,这些方法应用于节点分类任务时仍面临若干挑战。首先,现有的数据增强技术生成的新视图可能与原始视图存在显著差异,这可能会削弱视图的相关性,并影响模型训练的效率。其次,大多数现有的图对比学习算法依赖于大量负样本的使用。 为了解决上述挑战,本研究提出了一种名为简单网络图对比学习(SNGCL)的新节点分类对比学习方法。具体而言,SNGCL采用叠加多层拉普拉斯平滑滤波器作为数据处理步骤的一部分,以分别获得全局和局部特征平滑矩阵,并将这些矩阵传递给暹罗网络的目标网络和在线网络,最终使用改进的三重组损失函数来缩小类内距离并增大类间距离。 我们在节点分类任务中比较了SNGCL与最先进的模型,并且实验结果表明,在大多数任务上,SNGCL具有很强的竞争性。
https://arxiv.org/abs/2601.10150
Machine learning has achieved state-of-the-art results in network intrusion detection; however, its performance significantly degrades when confronted by a new attack class -- a zero-day attack. In simple terms, classical machine learning-based approaches are adept at identifying attack classes on which they have been previously trained, but struggle with those not included in their training data. One approach to addressing this shortcoming is to utilise anomaly detectors which train exclusively on benign data with the goal of generalising to all attack classes -- both known and zero-day. However, this comes at the expense of a prohibitively high false positive rate. This work proposes a novel contrastive loss function which is able to maintain the advantages of other contrastive learning-based approaches (robustness to imbalanced data) but can also generalise to zero-day attacks. Unlike anomaly detectors, this model learns the distributions of benign traffic using both benign and known malign samples, i.e. other well-known attack classes (not including the zero-day class), and consequently, achieves significant performance improvements. The proposed approach is experimentally verified on the Lycos2017 dataset where it achieves an AUROC improvement of .000065 and .060883 over previous models in known and zero-day attack detection, respectively. Finally, the proposed method is extended to open-set recognition achieving OpenAUC improvements of .170883 over existing approaches.
机器学习在网络安全入侵检测中取得了最先进的成果;然而,当面对新型攻击类别——即零日攻击时,其性能显著下降。简单来说,基于传统机器学习的方法擅长识别那些它们曾经训练过的攻击类型,但对于未包含在其训练数据中的新攻击类型则显得力不从心。为解决这一问题,一种方法是使用仅基于良性数据进行训练的异常检测器,目标是使其能够泛化到所有已知和未知(包括零日)类型的攻击上。然而,这种方法会导致极高的假阳性率。 本文提出了一种新颖的对比损失函数,在保持其他基于对比学习的方法的优势(即对不平衡数据具有鲁棒性)的同时,还可以将模型泛化能力扩展至零日攻击。与异常检测器不同的是,该模型利用良性流量和已知恶意样本(不包括零日类别的攻击类型)来学习良性和恶性流量的分布情况,并因此在性能上取得了显著提升。 实验验证了这一方法在Lycos2017数据集上的有效性,在已知攻击和零日攻击检测方面分别比先前的方法提高了.000065和.060883的AUROC值。最后,该研究将所提出的方法扩展到了开放集合识别领域,并达到了.170883的OpenAUC改进成绩,超越了现有方法。 通过这种方式,新的对比损失函数不仅解决了传统机器学习在面对零日攻击时性能下降的问题,还提高了模型对未知威胁的适应性和检测准确性。
https://arxiv.org/abs/2601.09902
The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
在非英语语言中,用于临床信息提取的注释数据集的稀缺性阻碍了主要用英语开发的大规模语言模型(LLM)方法的评估。在这项研究中,我们首次提出了针对英语和土耳其语的临床关系抽取(RE)任务的全面双语评价。为了支持这一评估,我们引入了第一个英土双语文本平行的临床RE数据集,该数据集源自2010年i2b2/VA关系分类语料库,并经过仔细整理。我们系统地评估了一系列不同的提示策略,包括多种上下文学习(ICL)和链式思维(CoT)方法,并将其性能与微调基准模型(如PURE)进行了比较。此外,我们提出了基于对比学习的关系感知检索(RAR),这是一种新型的在上下文示例选择方法,特别设计用于捕捉句子级别和关系级别的语义。我们的结果表明,基于提示的LLM方法始终优于传统的微调模型。而且,在所有评估的LLM和提示技术中,英语的性能都优于土耳其语。在ICL方法中,RAR取得了最高的表现,其中Gemini 1.5 Flash在英语中的微平均F1得分为0.906,在土耳其语中为0.888。当RAR与使用DeepSeek-V3模型的结构化推理提示结合时,性能进一步提高至英语中的0.918 F1得分。这些发现突显了高质量演示检索的重要性,并强调了先进的检索和提示技术在临床自然语言处理资源缺口方面具有巨大的潜力。
https://arxiv.org/abs/2601.09367
In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at this https URL.
在这篇论文中,我们提出了一种新颖的多模态框架——多模态语言引导网络(MMLGNet),旨在通过视觉-语言模型如CLIP将高光谱成像(HSI)和激光雷达等异构遥感模式与自然语言语义对齐。随着多模态地球观测数据可用性的增加,有效融合光谱、空间和几何信息的方法需求日益增长,并且这些方法还必须能够实现语义级别的理解。MMLGNet采用了特定于模态的编码器,并通过双向对比学习将视觉特征与手工制作的文本嵌入在共享潜在空间中对齐。受到CLIP训练范式的启发,我们的方法建立了一座连接高维遥感数据和语言引导解释之间的桥梁。 值得注意的是,MMLGNet使用简单的基于CNN的编码器实现了强大的性能,在两个基准数据集上超越了多种现有的仅限视觉的多模态方法,这证明了语言监督的重要价值。代码可在此网址获取:[提供URL]。
https://arxiv.org/abs/2601.08420
Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.
医学对比视觉-语言预训练(VLP)在提升下游任务表现方面展现了巨大潜力。传统方法通常采用对比学习,将成对的图像报告样本视为正例,而未配对的则被视为负例处理。然而,在医疗数据集中,来自不同患者的图像或报告之间可能存在大量相似性。僵化地将所有未配对样本都视为负例可能会破坏潜在的语义结构,并影响学到表示的质量。 在本文中,我们提出了一种多级对齐框架——具有语义感知实例和稀疏标记对齐的表示学习(SISTA),通过利用医学图像与放射学报告之间在两个层级上的语义对应关系:即图-文层面以及图像块-词项层面。具体来说,我们在传统的对比学习中引入了跨报告相似性以消除假负例,并提出了一种有效方法来对齐相关联的图像块和词项。 实验结果表明,所提出的框架在三个下游任务(图像分类、图像分割及目标检测)上显著提升了不同数据集之间的迁移性能。值得注意的是,即使面对有限标注数据的情况,我们的框架也能实现细粒度任务上的重大改进。代码与预训练模型将对外公开发布。
https://arxiv.org/abs/2601.08165