Graphs are ubiquitous, and learning on graphs has become a cornerstone in artificial intelligence and data mining communities. Unlike pixel grids in images or sequential structures in language, graphs exhibit a typical non-Euclidean structure with complex interactions among the objects. This paper argues that Riemannian geometry provides a principled and necessary foundation for graph representation learning, and that Riemannian graph learning should be viewed as a unifying paradigm rather than a collection of isolated techniques. While recent studies have explored the integration of graph learning and Riemannian geometry, most existing approaches are limited to a narrow class of manifolds, particularly hyperbolic spaces, and often adopt extrinsic manifold formulations. We contend that the central mission of Riemannian graph learning is to endow graph neural networks with intrinsic manifold structures, which remains underexplored. To advance this perspective, we identify key conceptual and methodological gaps in existing approaches and outline a structured research agenda along three dimensions: manifold type, neural architecture, and learning paradigm. We further discuss open challenges, theoretical foundations, and promising directions that are critical for unlocking the full potential of Riemannian graph learning. This paper aims to provide a coherent viewpoint and to stimulate broader exploration of Riemannian geometry as a foundational framework for future graph learning research.
图形无处不在,图上的学习已经成为人工智能和数据挖掘领域中的基石。与图像中的像素网格或语言中的序列结构不同,图展示了对象之间复杂相互作用的非欧几里得结构。本文认为黎曼几何为图表示学习提供了一个合理且必要的基础,并主张将黎曼图学习视为一种统一范式而非孤立技术的集合。虽然近期的研究已经探索了图学习与黎曼几何的整合,但大多数现有方法仅限于一类狭隘的流形,特别是双曲空间,并经常采用外在流形公式。我们认为黎曼图学习的核心使命是赋予图神经网络内在流形结构,这一点仍然鲜为人知。 为了推进这一视角,我们指出了现有方法中概念和方法论的关键差距,并概述了一个沿三个维度展开的研究议程:流形类型、神经架构以及学习范式。进一步地,本文还讨论了开放性挑战、理论基础及有前景的方向,这些对于解锁黎曼图学习的全部潜力至关重要。 本文旨在提供一个连贯的观点,并刺激对黎曼几何作为未来图学习研究基础框架进行更广泛探索的兴趣。
https://arxiv.org/abs/2602.10982
Accurate retinal vessel segmentation is a critical prerequisite for quantitative analysis of retinal images and computer-aided diagnosis of vascular diseases such as diabetic retinopathy. However, the elongated morphology, wide scale variation, and low contrast of retinal vessels pose significant challenges for existing methods, making it difficult to simultaneously preserve fine capillaries and maintain global topological continuity. To address these challenges, we propose the Vessel-aware Frequency-domain and Global Spatial modeling Network (VFGS-Net), an end-to-end segmentation framework that seamlessly integrates frequency-aware feature enhancement, dual-path convolutional representation learning, and bidirectional asymmetric spatial state-space modeling within a unified architecture. Specifically, VFGS-Net employs a dual-path feature convolution module to jointly capture fine-grained local textures and multi-scale contextual semantics. A novel vessel-aware frequency-domain channel attention mechanism is introduced to adaptively reweight spectral components, thereby enhancing vessel-relevant responses in high-level features. Furthermore, at the network bottleneck, we propose a bidirectional asymmetric Mamba2-based spatial modeling block to efficiently capture long-range spatial dependencies and strengthen the global continuity of vascular structures. Extensive experiments on four publicly available retinal vessel datasets demonstrate that VFGS-Net achieves competitive or superior performance compared to state-of-the-art methods. Notably, our model consistently improves segmentation accuracy for fine vessels, complex branching patterns, and low-contrast regions, highlighting its robustness and clinical potential.
精确的视网膜血管分割是定量分析视网膜图像和辅助诊断如糖尿病性视网膜病变等血管疾病的关键前提。然而,由于视网膜血管具有延长的形态、广泛的变化范围以及低对比度等特点,现有的方法在同时保持细小毛细血管并维持全局拓扑连续性方面面临着巨大挑战。为了解决这些问题,我们提出了一个基于频率域和全局空间建模的血管感知网络(VFGS-Net),这是一个端到端分割框架,它无缝集成了频率感知特征增强、双路径卷积表示学习以及双向不对称的空间状态空间模型。 具体来说,VFGS-Net采用了双重路径特征卷积模块来共同捕获细微局部纹理和多尺度上下文语义。此外,引入了一种新颖的血管感知频率域信道注意力机制,能够自适应地重新加权频谱成分,从而增强高层次特性中与血管相关的响应。 在模型瓶颈处,我们提出了一种基于双向不对称Mamba2的空间建模块,用以高效捕获长程空间依赖关系,并加强血管结构的全局连续性。我们在四个公开可用的视网膜血管数据集上进行了广泛的实验,结果表明VFGS-Net相较于最先进的方法实现了竞争性的或更优的表现。 特别地,我们的模型在细小血管、复杂分支模式和低对比度区域上的分割准确性方面表现出了持续的改进,这突显了其鲁棒性和临床应用潜力。
https://arxiv.org/abs/2602.10978
Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at this https URL.
深度神经网络在胸部X光分类中表现出强大的平均性能,但往往在特定的人口统计学亚群中的表现不佳,这引发了对临床安全性和公平性的重大担忧。现有的去偏方法通常会在数据集之间产生不一致的改进效果,或者通过降低整体诊断效用来实现公平性,这种做法将公平视为事后约束而非学习表示的一个属性。在这项工作中,我们提出了Stride-Net(敏感属性鲁棒学习通过解耦和可学习掩码与嵌入对齐),这是一种具有公平意识的框架,用于学习胸部X光分析中疾病识别但人口统计学不变的表示形式。 Stride-Net在补丁级别上运行,使用基于可学习步长的掩码选择标签对齐的图像区域,同时通过对抗混淆损失抑制敏感属性信息。为了将表示固定在临床语义上并避免捷径学习,我们进一步强制执行图像特征与BioBERT基础疾病标签嵌入之间的语义一致性,这是通过组最优传输实现的。 我们在MIMIC-CXR和CheXpert基准测试中评估了Stride-Net,在种族和交联种族性别亚群体之间。在包括ResNet和视觉变换器在内的各种架构中,Stride-Net始终提高了公平性指标,并且与基线准确度匹配或超过,实现了比先前的去偏方法更为有利的准确性-公平性权衡。 我们的代码可以在提供的链接处获取。
https://arxiv.org/abs/2602.10875
Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.
对比学习在表示学习方面取得了巨大成功,特别是在图像分类任务上。然而,在回归任务的研究中仍然存在不足,尤其是针对高光谱数据的应用。本文提出了一种适用于高光谱数据回归任务的光谱-空间对比学习框架,并采用一种模型无关的设计方式,可以增强如3D卷积和基于变换器的网络等骨干网络。此外,我们还提供了一系列用于增强高光谱数据的相关转换方法。在合成数据集和真实数据集上的实验表明,所提出的框架和转换显著提升了所有研究骨干模型的表现性能。
https://arxiv.org/abs/2602.10745
Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.
超分辨率(SR)技术在处理现实世界的低分辨率(LR)图像时,往往会遇到复杂的、不规则的退化现象,这些退化是由自然场景获取过程中的固有复杂性所引起的。与根据定义明确的情境生成的合成LR图像所产生的SR伪影相比,在真实生活情境中产生的这些扭曲是高度不可预测且变化多端的。因此,评估从现实世界低分辨率图像获得的超分辨图像(SR-IQA)的质量仍然是一项具有挑战性和未充分探索的问题。 在这项工作中,我们提出了一种专门针对这种高难度、非结构化现实场景的无参考SR-IQA方法。该方法能够为真实世界的SR应用提供领域自适应型IQA,在数据稀缺的情况下尤其有效。我们认为,超分辨率图像中的退化很大程度上取决于所使用的SR算法,而不仅仅是由图像内容决定的。 为此,我们引入了一种自我监督学习(SSL)策略:在预处理阶段,先对多个针对SR模型的不同表示进行预训练。我们的对比学习框架通过将同一种SR方法生成的图片配成正样本对,并将不同方法产生的图片配成负样本对来构建特征集,完全忽略图像内容的影响。 提出的S3 RIQA方法进一步包含了针对性的预处理步骤以提取互补的质量信息,并附加了一个辅助任务以便更好地应对与不同SR缩放因子相关的各种退化模式。为此,我们创建了新的数据集SRMORSS,用于支持无监督预训练;该数据集中包含广泛应用于大量真实LR图像的不同SR算法的应用实例,从而填补现有数据集的空白。 在实际SR-IQA基准测试中的实验表明,S3 RIQA方法在大多数当前最先进的相关度量标准中表现一致优异。
https://arxiv.org/abs/2602.10744
Asset retrieval--finding similar assets in a financial universe--is central to quantitative investment decision-making. Existing approaches define similarity through historical price patterns or sector classifications, but such backward-looking criteria provide no guarantee about future behavior. We argue that effective asset retrieval should be future-aligned: the retrieved assets should be those most likely to exhibit correlated future returns. To this end, we propose Future-Aligned Soft Contrastive Learning (FASCL), a representation learning framework whose soft contrastive loss uses pairwise future return correlations as continuous supervision targets. We further introduce an evaluation protocol designed to directly assess whether retrieved assets share similar future trajectories. Experiments on 4,229 US equities demonstrate that FASCL consistently outperforms 13 baselines across all future-behavior metrics. The source code will be available soon.
资产检索——在金融宇宙中找到相似的资产——是量化投资决策的核心。现有方法通过历史价格模式或行业分类来定义相似性,但这些基于过去的指标无法保证对未来的准确预测。我们认为,有效的资产检索应该与未来保持一致:即所检索到的资产应该是那些最有可能在未来表现出相关收益走势的资产。为此,我们提出了未来导向的软对比学习(Future-Aligned Soft Contrastive Learning, FASCL),这是一种表示学习框架,其软对比损失函数使用成对未来的收益关联性作为连续监督目标。此外,我们还引入了一种评估协议,旨在直接检验检索到的资产是否在未来轨迹上表现出相似性。实验结果基于4229只美国股票显示,在所有未来行为指标上,FASCL始终优于13个基准方法。源代码将很快发布。
https://arxiv.org/abs/2602.10711
Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at this https URL.
翻译如下: 仅解码器的大规模语言模型越来越多地被用作用户表示学习的行为编码器,然而注意力屏蔽对用户嵌入质量的影响仍然缺乏深入的研究。在本研究中,我们针对大规模现实世界的支付宝数据(该数据集整合了长期异构的用户行为)进行了一项系统性研究,探讨因果、混合和双向注意掩码在统一对比学习框架中的应用效果。为了改进从因果到双向注意力过渡过程中的训练动态,我们提出了一种基于梯度引导的软屏蔽方法——Gradient-Guided Soft Masking,在线性调度器渐进式开放未来注意力优化之前进行预热。通过评估涵盖预测、偏好和营销敏感性的9个工业用户认知基准任务,我们的方法在对比因果、混合及仅使用调度器的基线模型时,展现出了更稳定的训练过程以及更高质量的双向表示,同时仍与解码器预训练保持兼容。 总体而言,我们的研究结果突显了屏蔽设计和训练转换对于适应仅解码器的大规模语言模型以实现有效的用户表示学习的重要性。相关代码可在提供的网址获取。
https://arxiv.org/abs/2602.10622
Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.
现实世界中的接触密集型操作要求机器人能够感知时间上的触觉反馈,捕捉细微的表面变形,并推理出物体属性及力的动力学特性。尽管光学触觉传感器在这方面表现出独特的能力,但现有的触觉数据集和模型仍然存在局限性。这些资源主要关注对象级别的属性(如材料),而在物理交互过程中的细粒度触觉时间动态方面却严重忽视了这一点。我们认为,推进动态触觉感知需要一个系统的、多层次的感知能力体系来指导数据收集与模型设计。 为了弥补缺乏丰富动态信息的触觉数据这一空白,我们推出了ToucHD,这是一个大规模分层触觉数据集,涵盖了触觉原子动作、现实世界中的操作以及触摸-力成对数据。除了规模之外,ToucHD还建立了全面的触觉动力学数据生态系统,从数据层面明确支持了多层次感知能力。 基于此,我们提出了AnyTouch 2,这是一个针对各种光学触觉传感器的一般性触觉表示学习框架,它将对象级别的理解与细粒度、力感测的动态感知统一起来。该框架能够捕捉像素级和动作特定的变形,并且明确地建模物理力的动力学特性,从而从模型的角度学习多层次的动态感知能力。 我们在静态物体属性、动态物理属性以及跨越多层级动态感知能力的真实世界操作任务上对我们的模型进行了评估——从基本的对象级别理解到力感知灵巧操纵。实验结果表明,在不同传感器和任务中均表现出一致且强大的性能。
https://arxiv.org/abs/2602.09617
Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at this http URL
数字全滑病理图像(WSIs)提供了像素级的高分辨率图像,这对于疾病诊断非常有用。然而,由于标注训练数据量有限,数字病理图像分析面临着重大挑战,因为手动注释特定区域或从大型WSI中裁剪的小块需要大量的时间和精力。弱监督下的多实例学习(MIL)通过只需要袋子级别的标签提供了一种实用且高效的解决方案,而每个袋子通常包含多个实例(斑块)。大多数MIL方法直接使用由各种图像编码器生成的冻结图像斑块特征作为输入,并主要集中在特征聚合上。然而,在MIL环境中用于编码器预训练的特征表示学习被很大程度忽视了。 在我们的工作中,我们提出了一种名为弱监督对比学习(WeakSupCon)的新颖特征表示学习框架,该框架在训练过程中结合使用袋子级别的标签信息。我们的方法不依赖于实例级别伪标记,但在特征空间中有效地将具有不同标签的斑块分开。实验结果表明,在三个数据集中,我们提出的WeakSupCon方法生成的图像特征相比于自监督对比学习方法带来了更好的下游MIL性能改进。 相关代码可在此链接访问:[此URL](请根据实际发布的地址替换)。
https://arxiv.org/abs/2602.09477
Magnetic resonance imaging (MRI) is essential for nasopharyngeal carcinoma (NPC) radiotherapy (RT), but practical constraints, such as patient discomfort, long scan times, and high costs often lead to incomplete modalities in clinical practice, compromising RT planning accuracy. Traditional MRI synthesis methods are modality-specific, limited in anatomical adaptability, and lack clinical interpretability-failing to meet NPC's RT needs. Here, we developed a unified foundation model integrating contrastive visual representation learning and vision-language alignment (VLA) to enable any-to-all MRI synthesis. The model uses a contrastive encoder for modality-invariant representations and a CLIP-based text-informed decoder for semantically consistent synthesis, supporting any-to-all MRI synthesis via one unified foundation model. Trained on 40,825 images from 13 institutions, it achieves consistently high performance (average SSIM 0.90, PSNR 27) across 26 internal/external validation sites (15,748 images), with superior synthesis fidelity and robustness to noise and domain shifts. Meanwhile, its unified representation enhances downstream RT-relevant tasks (e.g., segmentation). This work advances digital medicine solutions for NPC care by leveraging foundation models to bridge technical synthesis and clinical utility.
磁共振成像(MRI)对于鼻咽癌(NPC)放射治疗(RT)至关重要,但实际限制如患者不适、长时间的扫描时间和高昂的成本往往导致临床实践中模式不完整,从而影响了RT计划的准确性。传统的MRI合成方法特定于某种模态,且在解剖适应性和临床解释性方面存在局限,无法满足NPC RT的需求。 为此,我们开发了一种结合对比视觉表示学习和视觉-语言对齐(VLA)的统一基础模型,以实现任意到所有模式的MRI合成。该模型利用对比编码器生成模态不变的表示,并采用基于CLIP的方法进行语义一致的解码,支持通过一个统一的基础模型来完成任何到所有MRI模式的转换。此模型在来自13个机构的40,825张图像上进行了训练,在包括内外部验证在内的26个站点(总计15,748张图像)上实现了稳定的高性能(平均SSIM 0.90,PSNR 27),同时具备优异的合成保真度和对噪声及领域变化的强大鲁棒性。此外,该模型的统一表示方式还增强了下游RT相关的任务(如分割任务)。这项工作通过利用基础模型来弥合技术合成与临床实用性的差距,推进了数字医学解决方案在NPC护理中的应用。
https://arxiv.org/abs/2602.08822
Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.
“Grokking”现象,即延迟泛化(delayed generalization),通常归因于深度神经网络的深度和组成结构。我们在最简单的可能设置下研究了这种现象:在数据线性可分(且最大间隔可分)的情况下学习具有对数损失函数的线性模型来进行二元分类。我们探讨了三种测试场景:(1) 测试数据与训练数据来自相同的分布,在这种情况下不会观察到Grokking;(2) 测试数据集中在边缘区域,在这种情况下会观察到Grokking;以及(3) 通过投影梯度下降(PGD)攻击生成的对抗性测试数据,在这种情况下也会观察到Grokking。我们理论分析表明,梯度下降所隐含的偏置诱导了一种三阶段的学习过程:整体样本主导、支持向量主导下的去学习和泛化,其中延迟泛化可能在此过程中出现。 我们的研究进一步将Grokking现象与数据中的不对称性关联起来,这些不对称性包括每个类别的示例数量以及各类别之间支点向量(支持向量)分布的差异。我们还给出了关于Grokking时间特性的描述。通过在不同的整体样本和支点向量分布上进行实验,并分析准确率曲线及超平面动态变化,我们的理论得到了验证。 总体而言,研究表明Grokking现象不需要深度网络或表示学习能力,并且即使在线性模型中也能因偏置项的动态而出现。
https://arxiv.org/abs/2602.08302
This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.
这项工作对通过利用过去五年来的架构进步来现代化视觉变换器(Vision Transformer,ViT)骨干网络进行了系统的调查。在保留传统的注意力-前馈网络(Attention-FFN)结构的同时,我们从归一化、激活函数、位置编码、门控机制和可学习令牌等方面进行逐组件的优化改进。这些更新形成了新一代的视觉变换器,我们将它们命名为ViT-5。大量的实验表明,无论是在理解基准还是生成基准上,ViT-5都始终优于现有的纯视觉变换器模型。 在ImageNet-1k分类任务中,在相当计算资源的情况下,ViT-5-BASE达到了84.2%的Top-1精度,而DeiT-III-BASE则为83.8%。此外,当将ViT-5作为生成模型的基础骨干时,它展现出了更强的能力:在SiT扩散框架中使用ViT-5相较于纯ViT基础骨干获得了更优的表现(FID得分为1.84 vs 2.06)。 除了上述的关键指标之外,ViT-5还展示了改进的表示学习和有利的空间推理行为,并且能够可靠地跨任务进行迁移。其设计与当代基础模型实践相契合,为2020年代中期的视觉骨干网络提供了简洁的升级替代方案——即从纯ViT到ViT-5的直接替换可以显著提升性能。
https://arxiv.org/abs/2602.08071
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in sub-optimal GCL performance. Inspired by meta-plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach named Meta Post-Refinement (MePo) for PTMs-based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi-level meta-learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second-order statistics for robust output alignment. MePo serves as a plug-in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal-free manner (e.g., 15.10\%, 13.36\%, and 12.56\% on CIFAR-100, ImageNet-R, and CUB-200 under Sup-21/1K). Our source code is available at \href{this https URL}{MePo}
为了应对外部世界不确定的变化,智能系统必须不断地从复杂且不断演变的环境中学习,并实时响应。这种能力集体称为通用持续学习(General Continual Learning, GCL),它涵盖了诸如在线数据流和模糊的任务边界等实际挑战。虽然通过预训练模型(Pretrained Models, PTMs)已大大推进了传统的持续学习(Conventional Continual Learning, CL),但这些方法在处理单次传递中多样且时间上交错的信息时仍然受限,导致GCL性能不佳。 受到神经科学中的元可塑性(meta-plasticity)和重构记忆的启发,我们在这里介绍了一种基于PTMs进行GCL的新颖方法,称为Meta Post-Refinement (MePo)。此方法从预训练数据中构建伪任务序列,并开发了一个双层元学习范式来优化预先训练好的骨干模型。这一过程相当于延长了预训练阶段,但大大促进了向下游GCL任务的快速适应性表示学习。 MePo进一步初始化一个元协方差矩阵作为预训练表示空间的参考几何形状,使GCL能够利用二次统计信息实现稳健的输出对齐。作为一个无需回放策略的方法(rehearsal-free),MePo在各种GCL基准和预训练检查点上实现了显著的性能提升(例如,在Sup-21/1K条件下,CIFAR-100提升了15.10%,ImageNet-R提升了13.36%,CUB-200提升了12.56%)。我们的源代码可在\href{this https URL}{MePo}中获取。
https://arxiv.org/abs/2602.07940
Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.
最近,视觉-语言模型(VLMs)作为强大的表示学习系统出现,它们将视觉观察与自然语言概念相匹配,为安全关键的自动驾驶中的语义推理提供了新的机会。本文研究了当这些视觉-语言表示被整合到感知、预测和规划管道中时,它们如何支持驾驶场景的安全评估和决策制定。 文中探讨了三个互补的系统级用例: 1. **轻量级无类别风险筛查方法**:我们引入了一种基于CLIP(一种图像-文本相似性模型)的方法,该方法利用图像-文本相似性来生成低延迟的语义风险信号。这种方法能够在没有显式对象检测或视觉问答的情况下,对各种类型和分布外的道路危险进行稳健地识别。 2. **场景级视觉语言嵌入与轨迹规划框架集成**:我们研究了将场景级别的视觉语言嵌入整合到基于Transformer的轨迹规划框架中的方法,并使用Waymo Open Dataset进行实验。结果表明,直接用全局嵌入来调整规划器并不会提高轨迹精度,这强调了表示和任务之间对齐的重要性,并推动了针对安全关键性规划的任务导向提取方法的发展。 3. **自然语言作为运动规划的行为约束**:我们研究了在doScenes数据集上使用自然语言作为明确行为约束的方法。在这种设置中,乘客式的指令(基于视觉场景元素)可以抑制罕见但严重的规划失败,并改善含糊情景中的安全一致行为。 综上所述,这些发现表明,在表达语义风险、意图和行为限制时,利用视觉-语言表示对于自动驾驶的安全性具有巨大的潜力。实现这一潜力本质上是一个工程问题,需要精心设计系统结构并进行有序的锚定,而不仅仅是直接注入特征。
https://arxiv.org/abs/2602.07680
Metastatic Progression remains the leading cause of cancer-related mortality, yet predicting whether a primary tumor will metastasize and where it will disseminate directly from histopathology remains a fundamental challenge. Although whole-slide images (WSIs) provide rich morphological information, prior computational pathology approaches typically address metastatic status or site prediction as isolated tasks, and do not explicitly model the clinically sequential decision process of metastatic risk assessment followed by downstream site-specific evaluation. To address this research gap, we present a decision-aware, concept-aligned MIL framework, HistoMet, for prognostic metastatic outcome prediction from primary tumor WSIs. Our proposed framework adopts a two-module prediction pipeline in which the likelihood of metastatic progression from the primary tumor is first estimated, followed by conditional prediction of metastatic site for high-risk cases. To guide representation learning and improve clinical interpretability, our framework integrates linguistically defined and data-adaptive metastatic concepts through a pretrained pathology vision-language model. We evaluate HistoMet on a multi-institutional pan-cancer cohort of 6504 patients with metastasis follow-up and site annotations. Under clinically relevant high-sensitivity screening settings (95 percent sensitivity), HistoMet significantly reduces downstream workload while maintaining high metastatic risk recall. Conditional on metastatic cases, HistoMet achieves a macro F1 of 74.6 with a standard deviation of 1.3 and a macro one-vs-rest AUC of 92.1. These results demonstrate that explicitly modeling clinical decision structure enables robust and deployable prognostic prediction of metastatic progression and site tropism directly from primary tumor histopathology.
转移性进展仍然是癌症相关死亡的主要原因,但仅凭组织病理学预测原发肿瘤是否会转移以及其扩散部位仍然是一项基本挑战。尽管全玻片图像(WSIs)提供了丰富的形态信息,以往的计算病理方法通常将转移状态或位置预测视为孤立的任务,并未明确模拟临床顺序决策过程中的转移风险评估和随后的位置特异性评估流程。为解决这一研究空白,我们提出了一种决策感知、概念对齐的MIL框架HistoMet,用于从原发肿瘤WSIs中进行预后性转移结局预测。我们的框架采用了一个双模块预测管道,在此框架下首先估计原发肿瘤发生转移的可能性,然后对于高风险病例进行条件位置特异性转移预测。为了引导表示学习并提高临床可解释性,我们通过预先训练的病理学视觉语言模型将语言定义和数据自适应的转移概念整合进我们的框架中。 我们在一个包含6504名有转移随访和位置标注的多机构泛癌症队列上评估了HistoMet。在临床上相关的高敏感度筛查设置(95%的敏感性)下,HistoMet显著减少了后续工作量,同时保持了高水平的转移风险召回率。对于转移病例而言,在条件约束下,HistoMet达到了74.6的宏F1分数和标准差为1.3,以及92.1的宏一对多AUC得分。这些结果表明,明确模拟临床决策结构能够实现从原发肿瘤组织病理学直接进行稳健且可部署的预后性转移进展及部位嗜好的预测。 简言之,该研究通过提出HistoMet框架成功地解决了现有计算病理方法中的局限性,并提高了对癌症转移及其位置的精准预测能力。
https://arxiv.org/abs/2602.07608
Precise identification of individual cows is a fundamental prerequisite for comprehensive digital management in smart livestock farming. While existing animal identification methods excel in controlled, single-camera settings, they face severe challenges regarding cross-camera generalization. When models trained on source cameras are deployed to new monitoring nodes characterized by divergent illumination, backgrounds, viewpoints, and heterogeneous imaging properties, recognition performance often degrades dramatically. This limits the large-scale application of non-contact technologies in dynamic, real-world farming environments. To address this challenge, this study proposes a cross-camera cow identification framework based on disentangled representation learning. This framework leverages the Subspace Identifiability Guarantee (SIG) theory in the context of bovine visual recognition. By modeling the underlying physical data generation process, we designed a principle-driven feature disentanglement module that decomposes observed images into multiple orthogonal latent subspaces. This mechanism effectively isolates stable, identity-related biometric features that remain invariant across cameras, thereby substantially improving generalization to unseen cameras. We constructed a high-quality dataset spanning five distinct camera nodes, covering heterogeneous acquisition devices and complex variations in lighting and angles. Extensive experiments across seven cross-camera tasks demonstrate that the proposed method achieves an average accuracy of 86.0%, significantly outperforming the Source-only Baseline (51.9%) and the strongest cross-camera baseline method (79.8%). This work establishes a subspace-theoretic feature disentanglement framework for collaborative cross-camera cow identification, offering a new paradigm for precise animal monitoring in uncontrolled smart farming environments.
精准识别个体奶牛是智能畜牧管理中全面数字化管理的基本前提。尽管现有的动物识别方法在单个摄像头的受控环境中表现出色,但它们在跨摄像头泛化方面面临着严重挑战。当在源摄像头上训练好的模型部署到具有不同照明条件、背景、视角和异质成像特性的新监控节点时,识别性能通常会显著下降。这限制了非接触技术在动态现实农场环境中的大规模应用。 为了解决这一问题,本研究提出了一种基于解耦表示学习的跨摄像头奶牛识别框架。该框架利用子空间可识别性保证(SIG)理论来解决反刍动物视觉识别中的挑战。通过建模底层物理数据生成过程,我们设计了一个以原理驱动的功能解耦模块,将观察到的图像分解为多个正交潜在子空间。这一机制有效地隔离了跨摄像头不变的身份相关生物特征,从而显著提高了对未见过摄像头的泛化能力。 为此,我们构建了一个高质量的数据集,涵盖了五个不同的摄像机节点,包括异构采集设备和光照以及角度复杂变化的情况。在七项跨摄像头任务上的广泛实验表明,所提出的方法实现了平均准确率为86.0%,远超源相机基线(51.9%)及最强的跨相机基准方法(79.8%)。这项工作建立了一个基于子空间理论的功能解耦框架,为协作式跨摄像头奶牛识别提供了新的范例,并且为在不受控智能农场环境中进行精确动物监测开辟了道路。
https://arxiv.org/abs/2602.07566
Chirality is a fundamental molecular property that governs stereospecific behavior in chemistry and biology. Capturing chirality in machine learning models remains challenging due to the geometric complexity of stereochemical relationships and the limitations of traditional molecular representations that often lack explicit stereochemical encoding. Existing approaches to chiral molecular representation primarily focus on central chirality, relying on handcrafted stereochemical tags or limited 3D encodings, and thus fail to generalize to more complex forms such as axial chirality. In this work, we introduce ChiDeK (Chiral Determinant Kernels), a framework that systematically integrates stereogenic information into molecular representation learning. We propose the chiral determinant kernel to encode the SE(3)-invariant chirality matrix and employ cross-attention to integrate stereochemical information from local chiral centers into the global molecular representation. This design enables explicit modeling of chiral-related features within a unified architecture, capable of jointly encoding central and axial chirality. To support the evaluation of axial chirality, we construct a new benchmark for electronic circular dichroism (ECD) and optical rotation (OR) prediction. Across four tasks, including R/S configuration classification, enantiomer ranking, ECD spectrum prediction, and OR prediction, ChiDeK achieves substantial improvements over state-of-the-art baselines, most notably yielding over 7% higher accuracy on axially chiral tasks on average.
手性是化学和生物学中分子的基本属性,它决定了立体特异性的行为。在机器学习模型中捕捉手性仍然具有挑战性,因为立体化学关系的几何复杂性和传统分子表示方法通常缺乏明确的手性编码限制了这一过程。现有的手性分子表征方法主要集中在中心手性上,依赖于人工设计的手性标签或有限的三维编码,因此无法推广到如轴向手性等更复杂的形态。在这项工作中,我们介绍了ChiDeK(手性决定核),这是一个系统地将立体生成信息整合进分子表示学习框架中的模型。我们提出了手性决定核来编码SE(3)不变的手性矩阵,并采用交叉注意力机制从局部手性中心集成立体化学信息到全局分子表征中。这种设计能够在统一的架构内明确建模与手性相关的特征,能够同时编码中心和轴向手性。 为了支持轴向手性的评估,我们构建了一个新的电子圆二色谱(ECD)和光学旋转(OR)预测基准测试集。在包括R/S构型分类、对映体排名、ECD光谱预测和OR预测在内的四个任务上,ChiDeK相比最先进的基线模型取得了显著的改进,在轴向手性相关任务上的准确率平均提高了超过7%。
https://arxiv.org/abs/2602.07415
Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on this https URL.
跨领域汇集异构数据集是表示学习中的一种常见策略,但简单的汇集方法可能会放大分布不对称性,并导致偏差估计问题,尤其是在需要零样本泛化的情况下。我们提出了一种匹配框架,该框架根据自适应中心点选择样本并迭代优化表示分布。通过双重稳健性和倾向得分匹配来处理数据领域包含的问题,使得匹配比简单的汇集和均匀抽样更鲁棒,能够过滤掉导致异质性的混淆领域(即主要的异构原因)。理论分析和实证研究表明,在不对称元分布下,与简单汇集或均匀抽样相比,匹配方法能取得更好的结果,并且这些改进还可以扩展到非高斯和多模态的真实世界设置中。最重要的是,我们展示了在零样本医学异常检测中的这些改进效果,这种情况下数据异质性和不对称性尤为极端。相关代码可在提供的URL上找到。
https://arxiv.org/abs/2602.07154
Various representation learning methods for molecular structures have been devised to accelerate data-driven chemistry. However, the representation capabilities of existing methods are essentially limited to atom-level information, which is not sufficient to describe real-world molecular physics. Although electron-level information can provide fundamental knowledge about chemical compounds beyond the atom-level information, obtaining the electron-level information in real-world molecules is computationally impractical and sometimes infeasible. We propose a method for learning electron-informed molecular representations without additional computation costs by transferring readily accessible electron-level information about small molecules to large molecules of our interest. The proposed method achieved state-of-the-art prediction accuracy on extensive benchmark datasets containing experimentally observed molecular physics. The source code for HEDMoL is available at this https URL.
为了加速数据驱动的化学研究,人们设计了多种分子结构表示学习方法。然而,现有方法的表现能力基本上仅限于原子层面的信息,这不足以描述现实世界中的分子物理特性。虽然电子层面的信息可以提供超出原子层面信息的基本知识来理解化合物,但在实际分子中获取这些电子层面的信息在计算上是不切实际甚至是不可能的。 我们提出了一种学习带有电子信息的分子表示的方法,在无需额外计算成本的情况下,通过将易于获取的小分子中的电子层面信息转移到感兴趣的大型分子上来实现这一点。我们的方法在包含实验观察到的分子物理特性的广泛基准数据集上的预测准确性达到了最先进的水平。HEDMoL的源代码可在[此处](https://this https URL)访问。 (注意:实际访问链接时请将"[此处](https://this https URL)"替换为正确的URL地址)
https://arxiv.org/abs/2602.07087
Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.
昆虫视觉支持复杂的行为,包括关联学习、导航和物体检测,并且长期以来一直是理解生物视觉处理的计算模型的动力来源。然而,许多当代模型在优化任务性能的同时忽视了生物学基础的处理路径。在这里,我们引入了一种受生物启发的视觉模型,该模型捕捉到了昆虫视觉系统的基本原则,将密集的视觉输入转化为稀疏、具有区分度的代码。该模型使用完全自监督对比目标进行训练,在没有标记数据的情况下支持表示学习,并且能够在不同任务中重用而不依赖于特定领域的分类器。我们对花识别任务和自然图像基准进行了结果表示的评估。模型始终产生可靠的稀疏码,能够区分视觉上相似的输入。为了支持不同的建模和部署需求,我们将该模型实现为人工神经网络和脉冲神经网络。在一个模拟定位场景中,我们的方法优于简单的图像下采样对比基线,突显了纳入类脑视觉处理路径的功能益处。总的来说,这些结果通过提供一种能够在各种任务上进行稀疏计算的通用生物启发式视觉模型,推动了昆虫计算建模的发展。
https://arxiv.org/abs/2602.06405