The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
在非英语语言中,用于临床信息提取的注释数据集的稀缺性阻碍了主要用英语开发的大规模语言模型(LLM)方法的评估。在这项研究中,我们首次提出了针对英语和土耳其语的临床关系抽取(RE)任务的全面双语评价。为了支持这一评估,我们引入了第一个英土双语文本平行的临床RE数据集,该数据集源自2010年i2b2/VA关系分类语料库,并经过仔细整理。我们系统地评估了一系列不同的提示策略,包括多种上下文学习(ICL)和链式思维(CoT)方法,并将其性能与微调基准模型(如PURE)进行了比较。此外,我们提出了基于对比学习的关系感知检索(RAR),这是一种新型的在上下文示例选择方法,特别设计用于捕捉句子级别和关系级别的语义。我们的结果表明,基于提示的LLM方法始终优于传统的微调模型。而且,在所有评估的LLM和提示技术中,英语的性能都优于土耳其语。在ICL方法中,RAR取得了最高的表现,其中Gemini 1.5 Flash在英语中的微平均F1得分为0.906,在土耳其语中为0.888。当RAR与使用DeepSeek-V3模型的结构化推理提示结合时,性能进一步提高至英语中的0.918 F1得分。这些发现突显了高质量演示检索的重要性,并强调了先进的检索和提示技术在临床自然语言处理资源缺口方面具有巨大的潜力。
https://arxiv.org/abs/2601.09367
In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at this https URL.
在这篇论文中,我们提出了一种新颖的多模态框架——多模态语言引导网络(MMLGNet),旨在通过视觉-语言模型如CLIP将高光谱成像(HSI)和激光雷达等异构遥感模式与自然语言语义对齐。随着多模态地球观测数据可用性的增加,有效融合光谱、空间和几何信息的方法需求日益增长,并且这些方法还必须能够实现语义级别的理解。MMLGNet采用了特定于模态的编码器,并通过双向对比学习将视觉特征与手工制作的文本嵌入在共享潜在空间中对齐。受到CLIP训练范式的启发,我们的方法建立了一座连接高维遥感数据和语言引导解释之间的桥梁。 值得注意的是,MMLGNet使用简单的基于CNN的编码器实现了强大的性能,在两个基准数据集上超越了多种现有的仅限视觉的多模态方法,这证明了语言监督的重要价值。代码可在此网址获取:[提供URL]。
https://arxiv.org/abs/2601.08420
Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.
医学对比视觉-语言预训练(VLP)在提升下游任务表现方面展现了巨大潜力。传统方法通常采用对比学习,将成对的图像报告样本视为正例,而未配对的则被视为负例处理。然而,在医疗数据集中,来自不同患者的图像或报告之间可能存在大量相似性。僵化地将所有未配对样本都视为负例可能会破坏潜在的语义结构,并影响学到表示的质量。 在本文中,我们提出了一种多级对齐框架——具有语义感知实例和稀疏标记对齐的表示学习(SISTA),通过利用医学图像与放射学报告之间在两个层级上的语义对应关系:即图-文层面以及图像块-词项层面。具体来说,我们在传统的对比学习中引入了跨报告相似性以消除假负例,并提出了一种有效方法来对齐相关联的图像块和词项。 实验结果表明,所提出的框架在三个下游任务(图像分类、图像分割及目标检测)上显著提升了不同数据集之间的迁移性能。值得注意的是,即使面对有限标注数据的情况,我们的框架也能实现细粒度任务上的重大改进。代码与预训练模型将对外公开发布。
https://arxiv.org/abs/2601.08165
Understanding narratives requires identifying which events are most salient for a story's progression. We present a contrastive learning framework for modeling narrative salience that learns story embeddings from narrative twins: stories that share the same plot but differ in surface form. Our model is trained to distinguish a story from both its narrative twin and a distractor with similar surface features but different plot. Using the resulting embeddings, we evaluate four narratologically motivated operations for inferring salience (deletion, shifting, disruption, and summarization). Experiments on short narratives from the ROCStories corpus and longer Wikipedia plot summaries show that contrastively learned story embeddings outperform a masked-language-model baseline, and that summarization is the most reliable operation for identifying salient sentences. If narrative twins are not available, random dropout can be used to generate the twins from a single story. Effective distractors can be obtained either by prompting LLMs or, in long-form narratives, by using different parts of the same story.
理解叙事需要识别哪些事件对故事的发展最为关键。我们提出了一种对比学习框架,用于建模叙述的显著性,该框架通过学习来自“叙述双胞胎”的故事嵌入来实现,即那些具有相同情节但表面形式不同的故事。我们的模型被训练以区分一个故事与其叙述双胞胎以及具有相似表面特征但不同情节的干扰故事。 利用这些生成的嵌入,我们评估了四种基于叙事学动机的操作,用于推断显著性(删除、移动、破坏和总结)。在来自ROCStories语料库的短篇叙事和维基百科剧情概要中的长篇文章上进行的实验表明,通过对比学习得到的故事嵌入优于遮蔽语言模型基准,并且总结是最可靠的方法来识别关键句子。 如果没有叙述双胞胎可用,则可以使用随机丢弃从单一故事中生成双胞胎。有效的干扰者可以通过提示大型语言模型(LLMs)获得,或者在长篇叙事中,通过使用同一故事的不同部分获取。
https://arxiv.org/abs/2601.07765
We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.
我们提出了UAIT(非常识性动作图文)数据集,这是一个新的评估基准,旨在测试视觉语言模型(VLM)在罕见的、非常规的动作场景中的语义理解能力。与以往专注于统计频率优势的常见视觉场景的数据集不同,UAIT通过提供语法合理但反常识意义的图像文本对来挑战模型。这类任务要求模型超越浅层模式识别,展示出对于主体-客体关系和物理可行性深刻的理解。 为了构建UAIT数据集,我们设计了一个半自动过程,利用大型语言模型、少量提示工程以及图文生成技术合成高质量的非常规动作图文样本。每个样本都配有精心设计的选择题来测试模型在细微推理方面的表现能力。我们评估了多个最先进的视觉语言模型,并将其与基于对比学习的模型进行了比较。 实验结果显示,在语义判断方面,所有模型的表现均显著低于人类,特别是在区分语法正确性和语义合理性时尤为明显。进一步的实验证明,即使是轻量级模型在经过微调后也能提高其准确性,这表明定向适应具有巨大的潜力。 这项研究不仅揭示了VLM的关键弱点,还为开发具备真实视觉语义推理能力的稳健模型提供了诊断工具和研究方向。
https://arxiv.org/abs/2601.07737
In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.
近年来,基于骨架的动作识别中的自监督表示学习随着对比学习方法的发展而取得了进展。然而,大多数对比范式本质上是判别式的,并且往往难以捕捉人类运动内在的变异性与不确定性。为了解决这一问题,我们提出了一种结合概率潜在建模和对比自监督学习的变分对比学习框架。这种形式化的方法能够学习出结构化的、语义上有意义的表示,这些表示在不同的数据集和监督水平上都能泛化得很好。我们在三个广泛使用的基于骨架的动作识别基准测试中进行了大量的实验,结果表明我们的方法始终优于现有的方法,在低标签环境下尤其如此。此外,定性的分析显示,与其它方法相比,我们提出的方法提供的特征更符合运动特性和样本特性,并且更加注重重要的骨骼关节。
https://arxiv.org/abs/2601.07666
Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.
长期以来,心理学研究利用圆周模型(circumplex models)来结构化情感,将相似的情感相邻放置并将对立的情感对角线放置。尽管这些模型经常被用来解释深度学习表示,但它们很少直接融入语言模型的表示学习中,因此其几何有效性尚未得到充分探索。本文提出了一种方法,通过在超球体上进行对比学习,在语言模型嵌入中诱导出圆形情感表示。我们发现,虽然这种圆周对齐提供了更好的可解释性和对抗维度缩减的鲁棒性,但在高维设置和细粒度分类任务中,其表现不及传统的设计。我们的研究结果阐明了将心理圆周模型应用于深度学习架构时所涉及的权衡。
https://arxiv.org/abs/2601.06575
Accurate drug-target interaction (DTI) prediction is essential for computational drug discovery, yet existing models often rely on single-modality predefined molecular descriptors or sequence-based embeddings with limited representativeness. We propose Tensor-DTI, a contrastive learning framework that integrates multimodal embeddings from molecular graphs, protein language models, and binding-site predictions to improve interaction modeling. Tensor-DTI employs a siamese dual-encoder architecture, enabling it to capture both chemical and structural interaction features while distinguishing interacting from non-interacting pairs. Evaluations on multiple DTI benchmarks demonstrate that Tensor-DTI outperforms existing sequence-based and graph-based models. We also conduct large-scale inference experiments on CDK2 across billion-scale chemical libraries, where Tensor-DTI produces chemically plausible hit distributions even when CDK2 is withheld from training. In enrichment studies against Glide docking and Boltz-2 co-folder, Tensor-DTI remains competitive on CDK2 and improves the screening budget required to recover moderate fractions of high-affinity ligands on out-of-family targets under strict family-holdout splits. Additionally, we explore its applicability to protein-RNA and peptide-protein interactions. Our findings highlight the benefits of integrating multimodal information with contrastive objectives to enhance interaction-prediction accuracy and to provide more interpretable and reliability-aware models for virtual screening.
准确的药物-靶标相互作用(DTI)预测对于计算药物发现至关重要,但现有的模型通常依赖于单模态预定义的分子描述符或序列嵌入表示,这些表示具有有限的表现力。我们提出了Tensor-DTI,这是一种对比学习框架,它集成了来自分子图、蛋白质语言模型和结合位点预测的多模态嵌入,以提高相互作用建模的准确性。Tensor-DTI采用了一种孪生双编码器架构,能够同时捕获化学和结构相互作用特征,并区分相互作用与非相互作用配对。在多个DTI基准测试上的评估表明,Tensor-DTI优于现有的基于序列和基于图的模型。 我们还进行了大规模的CDK2(细胞周期蛋白依赖性激酶 2)相关推断实验,在包含数十亿种化学物质库中,即使在训练数据中没有包括CDK2的情况下,Tensor-DTI也能产生合理的命中分布。在与Glide对接和Boltz-2折叠器的富集研究中,Tensor-DTI在CDK2上保持竞争力,并且在严格的家庭保留分割(family-holdout splits)下,在家族外目标上恢复高亲和力配体所需的筛选预算有所减少。 此外,我们还探讨了它在蛋白质-RNA相互作用和肽-蛋白质相互作用中的适用性。我们的发现强调了结合多模态信息与对比目标以增强相互作用预测准确性,并为虚拟筛选提供更可解释性和可靠性更高的模型的益处。
https://arxiv.org/abs/2601.05792
Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.
最近的研究展示了将组相对策略优化(GRPO)集成到流匹配模型中的有效性,尤其是在文本到图像和文本到视频生成领域。然而,我们发现直接将这些技术应用于图像到视频(I2V)模型时,通常无法持续提高奖励表现。为了解决这一局限性,我们提出了TAGRPO,这是一种受对比学习启发的、适用于I2V模型的稳健后期训练框架。 我们的方法基于这样的观察:从相同的初始噪声生成的回放视频可以提供更好的优化指导。利用这一点,我们提出了一种新颖的针对中间潜在变量(latent)的GRPO损失函数,旨在直接对准高奖励轨迹并尽可能远离低奖励路径。此外,我们还引入了一个存储回放视频的记忆库,以增强多样性并减少计算开销。 尽管TAGRPO设计简单,但在I2V生成任务中它显著超越了DanceGRPO的表现。
https://arxiv.org/abs/2601.05729
Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model's resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.
近期,单细胞多组学(特别是RNA-seq)的进展为了解细胞异质性和基因调控提供了深刻的见解。尽管基于预训练语言模型(PLM)的单细胞基础模型显示出巨大的潜力,但它们仍然受限于无法充分整合深入个体特征分析以及忽视了跨模态数据中的噪声影响。为了同时解决这些问题,我们提出了一种开放世界语言知识辅助鲁棒单细胞基础模型 (OKR-CELL)。该模型基于一个跨模式的Cell-Language预训练框架构建,并包含两个关键创新: 1. 利用大型语言模型(LLM)的工作流程结合检索增强生成(RAG),通过开放式知识丰富了对单元格文本描述的理解。 2. 设计了一种跨模态鲁棒校准 (CRA) 目标,该目标包括样本可靠性评估、课程学习以及耦合动量对比学习,以加强模型对抗噪声数据的能力。 在3200万个细胞-文本对上进行预训练后,OKR-CELL在6个评价任务中取得了前沿结果。除了标准基准测试(如细胞聚类、细胞类型注释、批次效应校正和少量样本注释)之外,该模型还在更广泛的跨模态应用中表现出色,包括零样本细胞类型注释和双向细胞文本检索。
https://arxiv.org/abs/2601.05648
Malicious image manipulation threatens public safety and requires efficient localization methods. Existing approaches depend on costly pixel-level annotations which make training expensive. Existing weakly supervised methods rely only on image-level binary labels and focus on global classification, often overlooking local edge cues that are critical for precise localization. We observe that feature variations at manipulated boundaries are substantially larger than in interior regions. To address this gap, we propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues so that CLIPs multimodal similarity highlights manipulation edges rather than high-level object semantics. SAPL combines two complementary modules Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL) to exploit edge information in both textual and visual spaces. The proposed ECPL leverages edge-enhanced image features to generate learnable textual prompts via an attention mechanism, embedding semantic-irrelevant information into text features, to guide CLIP focusing on manipulation edges. The proposed HECL extract genuine and manipulated edge patches, and utilize contrastive learning to boost the discrimination between genuine edge patches and manipulated edge patches. Finally, we predict the manipulated regions from the similarity map after processing. Extensive experiments on multiple public benchmarks demonstrate that SAPL significantly outperforms existing approaches, achieving state-of-the-art localization performance.
恶意图像篡改威胁到公共安全,需要高效的定位方法。现有方法依赖于昂贵的像素级标注,使得训练成本高昂。现有的弱监督方法仅依靠图级别二进制标签,并专注于全局分类,往往忽视了局部边缘线索对于精确定位至关重要的作用。我们观察到,在被篡改边界上的特征变化远大于内部区域的变化。为解决这一问题,我们在CLIP模型中提出了语义不可知提示学习(SAPL),该方法通过有意编码非语义的边界中心线索来学习文本提示,使CLIP的多模态相似性更能突出篡改边缘而不是高层次的对象语义。SAPL结合了两个互补模块:边缘感知上下文提示学习(ECPL)和分层边缘对比学习(HECL),用于在文本空间和视觉空间中利用边缘信息。 所提出的ECPL通过增强图像特征的边缘,生成可学习的文字提示,并借助注意力机制将语义无关的信息嵌入到文字特征中,引导CLIP关注篡改边缘。而所提出的HECL则提取真实边界片和被篡改边界片,并利用对比学习来提高对真实边界片与被篡改边界片之间的区分能力。 最后,在处理完相似性地图后,我们预测出被篡改的区域。在多个公开基准上的广泛实验表明,SAPL显著优于现有方法,实现了最先进的定位性能。
https://arxiv.org/abs/2601.06222
Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.
传统的句子嵌入方法采用非生成预训练模型上的词级别对比学习。最近,基于生成式大型语言模型(LLM)的嵌入方法开始出现。这些方法要么依赖于固定的提示模板,要么涉及对模型架构进行修改。前者缺乏进一步优化模型的能力,导致性能受限;后者则改变模型内部计算机制,从而损害其生成能力。我们提出了一种新的方法——SemPA,通过语义偏好对齐来增强句子表示,同时保持LLM的生成能力。我们利用句子级别的直接偏好优化(DPO)高效地在同义句生成任务上优化LLMs,在该任务中模型学习区分语义等价的句子的同时保留固有的生成能力。理论上,我们在Plackett-Luce模型框架下建立了DPO与对比学习之间的正式联系。实证研究显示,在语义文本相似性任务和各种针对LLM的基准测试中,SemPA能够在不牺牲LLMs内在生成能力的前提下提供更优的语义表示。
https://arxiv.org/abs/2601.05075
Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.
大型语言模型在多项选择题答题基准上展示了卓越的能力,但其大规模神经元背后的复杂机制仍然不清楚,这给理解和指导这些模型带来了重大挑战。尽管最近的研究在识别特定能力的责任神经元方面取得了进展,但这些方法仅针对单一任务场景中多能力协同作用的情况是不可行的。此外,这类方法只关注与任务完成正相关的支持性神经元,而忽略了具有其他功能(如抑制性功能)的神经元,并且由于大型语言模型中的偶然行为导致了误导性的神经元归因问题(即通过碰巧而非真正理解正确回答问题)。为了解决这些挑战,我们提出了NeuronLLM,这是一个新颖的任务级大型语言模型理解框架,它采用生物学的功能对抗原则来识别大型语言模型的神经元。关键洞察是,任务性能是由具有两种对立功能的神经元共同决定的:促进任务完成的好神经元和抑制任务完成的坏神经元。通过对比学习好神经元与坏神经元,NeuronLLM实现了对神经元的整体建模,并利用增强的问题集来减轻大型语言模型中的偶然行为。在不同规模和家族的大型语言模型上进行的全面实验显示,在四个自然语言处理任务中,NeuronLLM优于现有方法,为理解大型语言模型的功能组织提供了新的见解。
https://arxiv.org/abs/2601.04548
Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on
卫星不断生成大量的数据,特别是在地球观测方面,包括卫星图像时间序列(SITS)。然而,大多数深度学习模型都是为处理整个图像或完整的时序序列而设计的,以便提取下游任务中所需的有意义特征。在这项研究中,我们提出了一种新颖的多模态方法,利用像素级二维(2D)表示来更有效地编码从SITS中的视觉属性变化。具体而言,我们将基于植被指数时间序列(NDVI、EVI和SAVI)生成相位图作为原始像素值的替代方案,以创建更具信息量的表现形式。此外,我们引入了PIxel-wise Multimodal Contrastive (PIMC),这是一种新的多模态自监督方法,该方法基于二维像素时间序列表示和遥感影像(RSI)产生有效的编码器。为了验证我们的方法,我们在三个下游任务上评估其性能:使用PASTIS数据集进行的像素级预测和分类以及在EuroSAT数据集上的土地覆盖分类。此外,我们还在所有下游任务中将结果与最先进的(SOTA)方法进行了比较。实验结果显示,2D表示的使用显著提高了从SITS提取特征的能力,而对比学习则提升了对像素时间序列和RSI表现形式的质量改进。这些发现表明,我们的多模态方法在各种地球观测任务中超越了现有模型,成为一种稳健的自监督框架用于处理SITS和RSI。 代码已发布于 [请在此处添加代码链接或详细说明如何获取代码]
https://arxiv.org/abs/2601.04127
Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: this https URL.
通用视觉-语言-动作模型目前受到机器人数据稀缺的限制,与人类视频演示相比,后者的数量更为丰富。现有的潜在行动模型试图利用视频数据,但常常因视觉纠缠问题而无法捕捉到操作技能,而是记录了噪声信息。为了解决这个问题,我们提出了一种对齐框架Contrastive Latent Action Pretraining (CLAP),它将从视频中提取的视觉潜空间与机器人轨迹中的本体感受潜空间进行对齐。通过对比学习方法,CLAP 将视频转换映射到一个量化的、物理上可执行的代码库中。在此基础上,我们引入了一种双形式的VLA框架,包括两种模型:CLAP-NTP(一种擅长于指令跟随和对象泛化的自回归模型)以及 CLAP-RF(基于修正流策略的方法,专为高频精确操作设计)。此外,我们提出了一种知识匹配(KM)正则化策略以减轻微调过程中的灾难性遗忘问题。广泛的实验表明,CLAP在性能上显著超越了强大的基线方法,能够有效将人类视频中的技能转移到机器人的执行中。 项目页面:[请在此处插入具体的项目链接]
https://arxiv.org/abs/2601.04061
Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model's original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.
语言模型中的文化意识是指理解并适应多元文化背景的能力。然而,大多数现有的方法将文化视为静态的背景知识,忽略了文化的动态性和演变特性。这种限制降低了它们在需要真正文化敏感性的下游任务中的可靠性。为此,我们引入了CALM(Cultural Awareness in Language Models),这是一种新型框架,旨在赋予语言模型文化自我意识。CALM将任务语义与显式的文化概念和潜在的文化信号分离,并通过对比学习将其组织成结构化的文化集群。随后,这些集群通过跨注意力机制对齐,以建立相关文化特征之间的细微交互,并根据特定文化的维度通过专家混合(Mixture-of-Experts)机制进行自适应融合。最终形成的统一表示被融合到模型原有的知识中,构建出一个基于文化的内在身份状态,该状态进一步通过自我提示的反思学习得到加强,从而能够实现持续的适应和自我纠正。在多个跨文化基准数据集上的广泛实验表明,CALM始终优于最先进的方法。
https://arxiv.org/abs/2601.03483
Immunohistochemical (IHC) staining provides crucial molecular characterization of tissue samples and plays an indispensable role in the clinical examination and diagnosis of cancers. However, compared with the commonly used Hematoxylin and Eosin (H&E) staining, IHC staining involves complex procedures and is both time-consuming and expensive, which limits its widespread clinical use. Virtual staining converts H&E images to IHC images, offering a cost-effective alternative to clinical IHC staining. Nevertheless, using adjacent slides as ground truth often results in weakly-paired data with spatial misalignment and local deformations, hindering effective supervised learning. To address these challenges, we propose a novel topology-aware framework for H&E-to-IHC virtual staining. Specifically, we introduce a Topology-aware Consistency Matching (TACM) mechanism that employs graph contrastive learning and topological perturbations to learn robust matching patterns despite spatial misalignments, ensuring structural consistency. Furthermore, we propose a Topology-constrained Pathological Matching (TCPM) mechanism that aligns pathological positive regions based on node importance to enhance pathological consistency. Extensive experiments on two benchmarks across four staining tasks demonstrate that our method outperforms state-of-the-art approaches, achieving superior generation quality with higher clinical relevance.
免疫组化(IHC)染色为组织样本提供了重要的分子特征描述,并在癌症的临床检查和诊断中扮演着不可或缺的角色。然而,与常用的苏木精-伊红(H&E)染色相比,IHC染色涉及复杂的程序,耗时且成本高昂,这限制了其在临床上的应用普及度。虚拟染色将H&E图像转换为IHC图像,提供了一种成本效益更高的临床IHC染色替代方案。然而,使用相邻切片作为真实数据常常会导致配对较弱的数据出现空间错位和局部变形,从而阻碍有效的监督学习。 为了应对这些挑战,我们提出了一种新颖的拓扑感知框架用于H&E到IHC虚拟染色。具体来说,我们引入了基于图对比学习和拓扑扰动的拓扑感知一致性匹配(TACM)机制,即使在空间错位的情况下也能学习稳健的一致性模式,确保结构一致性的前提下提高准确度。此外,我们还提出了一种拓扑约束病理匹配(TCPM)机制,该机制基于节点的重要性对齐病理阳性区域以增强病理一致性。 我们在涵盖四个染色任务的两个基准数据集上的广泛实验表明,我们的方法超越了现有最先进的技术,在生成质量和临床相关性方面均表现出卓越性能。
https://arxiv.org/abs/2601.02806
Question answering (QA) systems achieve impressive performance on standard benchmarks like SQuAD, but remain vulnerable to adversarial examples. This project investigates the adversarial robustness of transformer models on the AddSent adversarial dataset through systematic experimentation across model scales and targeted mitigation strategies. We perform comprehensive multi-level error analysis using five complementary categorization schemes, identifying negation confusion and entity substitution as the primary failure modes. Through systematic evaluation of adversarial fine-tuning ratios, we identify 80% clean + 20% adversarial data as optimal. Data augmentation experiments reveal a capacity bottleneck in small models. Scaling from ELECTRA-small (14M parameters) to ELECTRA-base (110M parameters) eliminates the robustness-accuracy trade-off, achieving substantial improvements on both clean and adversarial data. We implement three targeted mitigation strategies, with Entity-Aware contrastive learning achieving best performance: 89.89% AddSent Exact Match (EM) and 90.73% SQuAD EM, representing 94.9% closure of the adversarial gap. To our knowledge, this is the first work integrating comprehensive linguistic error analysis with Named Entity Recognition (NER)-guided contrastive learning for adversarial QA, demonstrating that targeted mitigation can achieve near-parity between clean and adversarial performance.
问答(QA)系统在SQuAD等标准基准测试中表现出色,但仍然容易受到对抗性示例的影响。本项目通过跨模型规模和针对性缓解策略的系统实验,调查了变压器模型在AddSent对抗数据集上的鲁棒性。我们使用五种互补分类方案进行了全面的多层次错误分析,确定否定混淆和实体替换为主要失败模式。通过对对抗微调比例进行系统的评估,我们发现80%干净数据加20%对抗数据是最优配置。数据增强实验揭示了小型模型在容量瓶颈方面的问题。从小型ELECTRA(14M参数)扩展到基础级ELECTRA(110M参数),可以消除鲁棒性和准确性的权衡,并在清洁和对抗性数据上都取得了显著改进。我们实施了三种针对性的缓解策略,其中基于实体感知的对比学习表现最佳:AddSent精确匹配(EM)为89.89%,SQuAD EM为90.73%,这表明在对抗差距方面达到了94.9%的接近度。据我们所知,这是首次结合全面的语言学错误分析和命名实体识别指导下的对比学习来应对对抗性QA问题的工作,证明了针对性缓解可以实现清洁和对抗性能之间的几乎等同。
https://arxiv.org/abs/2601.02700
Device-free crowd-counting using WiFi Channel State Information (CSI) is a key enabling technology for a new generation of privacy-preserving Internet of Things (IoT) applications. However, practical deployment is severely hampered by the domain shift problem, where models trained in one environment fail to generalise to another. To overcome this, we propose a novel two-stage framework centred on a CSI-ResNet-A architecture. This model is pre-trained via self-supervised contrastive learning to learn domain-invariant representations and leverages lightweight Adapter modules for highly efficient fine-tuning. The resulting event sequence is then processed by a stateful counting machine to produce a final, stable occupancy estimate. We validate our framework extensively. On our WiFlow dataset, our unsupervised approach excels in a 10-shot learning scenario, achieving a final Mean Absolute Error (MAE) of just 0.44--a task where supervised baselines fail. To formally quantify robustness, we introduce the Generalisation Index (GI), on which our model scores near-perfectly, confirming its ability to generalise. Furthermore, our framework sets a new state-of-the-art public WiAR benchmark with 98.8\% accuracy. Our ablation studies reveal the core strength of our design: adapter-based fine-tuning achieves performance within 1\% of a full fine-tune (98.84\% vs. 99.67\%) while training 97.2\% fewer parameters. Our work provides a practical and scalable solution for developing robust sensing systems ready for real-world IoT deployments.
无需设备的人群计数利用WiFi信道状态信息(CSI)是新一代隐私保护物联网应用的关键技术。然而,实际部署受到领域变化问题的严重阻碍,在该问题中,一个环境训练的模型无法在另一个环境中泛化。为解决这一问题,我们提出了一种基于CSI-ResNet-A架构的新颖两阶段框架。此模型通过自我监督对比学习进行预训练以学习领域不变表示,并利用轻量级Adapter模块实现高效的微调。随后,事件序列由一个有状态计数机处理,生成最终稳定的占用估计。 我们在WiFlow数据集上对我们的框架进行了广泛的验证。在10次射击学习场景中,我们无监督的方法表现卓越,最终均方误差(MAE)仅为0.44——这是一个基于监督的基线方法失败的任务。为正式量化鲁棒性,我们引入了泛化指数(GI),该指标上我们的模型几乎完美得分,证实其泛化的强大能力。 此外,我们的框架在公共WiAR基准测试中建立了新的状态-of-the-art准确度记录,达到98.8%的精度。通过消融研究揭示了设计的核心优势:基于适配器的微调与全面微调(98.84%对比99.67%)仅相差1%,但训练参数减少了97.2%。 我们的工作提供了一种实用且可扩展的解决方案,为适用于实际部署的稳健传感系统的发展奠定了基础。
https://arxiv.org/abs/2601.02203
Since floorplan data is readily available, long-term persistent, and robust to changes in visual appearance, visual Floorplan Localization (FLoc) has garnered significant attention. Existing methods either ingeniously match geometric priors or utilize sparse semantics to reduce FLoc uncertainty. However, they still suffer from ambiguous FLoc caused by repetitive structures within minimalist floorplans. Moreover, expensive but limited semantic annotations restrict their applicability. To address these issues, we propose DisCo-FLoc, which utilizes dual-level visual-geometric Contrasts to Disambiguate depth-aware visual Floc, without requiring additional semantic labels. Our solution begins with a ray regression predictor tailored for ray-casting-based FLoc, predicting a series of FLoc candidates using depth estimation expertise. In addition, a novel contrastive learning method with position-level and orientation-level constraints is proposed to strictly match depth-aware visual features with the corresponding geometric structures in the floorplan. Such matches can effectively eliminate FLoc ambiguity and select the optimal imaging pose from FLoc candidates. Exhaustive comparative studies on two standard visual Floc benchmarks demonstrate that our method outperforms the state-of-the-art semantic-based method, achieving significant improvements in both robustness and accuracy.
由于楼层平面图数据易于获取、长期稳定且对视觉外观变化具有较强的鲁棒性,因此视觉楼层定位(FLoc)已经引起了广泛的关注。现有的方法要么巧妙地匹配几何先验知识,要么利用稀疏的语义信息来减少FLoc的不确定性。然而,它们仍然受到极简主义平面图中重复结构引起的模糊定位问题的影响,并且昂贵但有限的语义标注限制了其适用性。为了解决这些问题,我们提出了DisCo-FLoc,该方法利用双层视觉-几何对比来消除具有深度感知能力的视觉FLoc中的歧义,而无需额外的语义标签。 我们的解决方案首先采用了一种针对光线投射(ray-casting)基础定位法的光线回归预测器,并使用深度估计的专业知识预测一系列FLoc候选位置。此外,我们提出了一种新颖的对比学习方法,该方法具有位置级和方向级约束,可以严格地将具有深度感知能力的视觉特征与平面图中的相应几何结构匹配起来。这种匹配能够有效地消除定位模糊性并从FLoc候选中选择最佳成像姿态。 在两个标准视觉FLoc基准上的详尽比较研究显示,我们的方法优于最先进的基于语义的方法,在鲁棒性和准确性方面取得了显著改善。
https://arxiv.org/abs/2601.01822