Recently computer-aided diagnosis has demonstrated promising performance, effectively alleviating the workload of clinicians. However, the inherent sample imbalance among different diseases leads algorithms biased to the majority categories, leading to poor performance for rare categories. Existing works formulated this challenge as a long-tailed problem and attempted to tackle it by decoupling the feature representation and classification. Yet, due to the imbalanced distribution and limited samples from tail classes, these works are prone to biased representation learning and insufficient classifier calibration. To tackle these problems, we propose a new Long-tailed Medical Diagnosis (LMD) framework for balanced medical image classification on long-tailed datasets. In the initial stage, we develop a Relation-aware Representation Learning (RRL) scheme to boost the representation ability by encouraging the encoder to capture intrinsic semantic features through different data augmentations. In the subsequent stage, we propose an Iterative Classifier Calibration (ICC) scheme to calibrate the classifier iteratively. This is achieved by generating a large number of balanced virtual features and fine-tuning the encoder using an Expectation-Maximization manner. The proposed ICC compensates for minority categories to facilitate unbiased classifier optimization while maintaining the diagnostic knowledge in majority classes. Comprehensive experiments on three public long-tailed medical datasets demonstrate that our LMD framework significantly surpasses state-of-the-art approaches. The source code can be accessed at this https URL.
最近,计算机辅助诊断展示了有希望的表现,有效地减轻了临床医生的工作负担。然而,不同疾病之间固有的样本不平衡导致算法倾向于对多数类别的偏见,从而导致罕见类别性能不佳。现有的研究将这一挑战视为一个长尾问题,并试图通过解耦特征表示和分类来解决它。然而,由于分布不均和来自少数类别的有限样本数量,这些方法容易出现偏差的表示学习和不足的分类器校准。 为了应对这些问题,我们提出了一种新的长期医疗诊断(LMD)框架,用于长尾数据集上的平衡医学图像分类。在初始阶段,我们开发了一个关系感知表示学习(RRL)方案,通过不同的数据增强方式鼓励编码器捕捉内在语义特征来提升表示能力。随后,在第二阶段,我们提出了一种迭代分类器校准(ICC)方案,以实现通过生成大量均衡的虚拟特征并使用期望-最大化方法微调编码器的方式来进行迭代校准。这种提出的ICC为少数类别提供了补偿,促进了无偏见的分类器优化,同时保持了多数类别的诊断知识。 在三个公共长尾医学数据集上的全面实验表明,我们的LMD框架显著超越了最先进的方法。源代码可在[此链接](https://example.com)访问(注意:原文中的实际URL未给出,此处用示例表示)。
https://arxiv.org/abs/2502.03238
In this study, Disentanglement in Difference(DiD) is proposed to address the inherent inconsistency between the statistical independence of latent variables and the goal of semantic disentanglement in disentanglement representation learning. Conventional disentanglement methods achieve disentanglement representation by improving statistical independence among latent variables. However, the statistical independence of latent variables does not necessarily imply that they are semantically unrelated, thus, improving statistical independence does not always enhance disentanglement performance. To address the above issue, DiD is proposed to directly learn semantic differences rather than the statistical independence of latent variables. In the DiD, a Difference Encoder is designed to measure the semantic differences; a contrastive loss function is established to facilitate inter-dimensional comparison. Both of them allow the model to directly differentiate and disentangle distinct semantic factors, thereby resolving the inconsistency between statistical independence and semantic disentanglement. Experimental results on the dSprites and 3DShapes datasets demonstrate that the proposed DiD outperforms existing mainstream methods across various disentanglement metrics.
在这项研究中,提出了差异中的解缠(DiD)来解决潜在变量的统计独立性和语义解缠目标之间的内在不一致性问题。传统的解缠方法通过提高潜在变量之间的统计独立性来实现解缠表示。然而,潜在变量的统计独立性并不一定意味着它们在语义上是无关的,因此,改善统计独立性并不总是能增强解缠性能。为了解决上述问题,DiD被提出直接学习语义差异而不是关注潜在变量的统计独立性。 在DiD中,设计了一个差异编码器来测量语义差异;建立了对比损失函数以促进跨维度比较。这两种方法使模型能够直接区分和解开不同的语义因素,从而解决了统计独立性和语义解缠之间的不一致性问题。在dSprites和3DShapes数据集上的实验结果表明,提出的DiD在各种解缠指标上均优于现有的主流方法。
https://arxiv.org/abs/2502.03123
Universal time series representation learning is challenging but valuable in real-world applications such as classification, anomaly detection, and forecasting. Recently, contrastive learning (CL) has been actively explored to tackle time series representation. However, a key challenge is that the data augmentation process in CL can distort seasonal patterns or temporal dependencies, inevitably leading to a loss of semantic information. To address this challenge, we propose Topological Contrastive Learning for time series (TopoCL). TopoCL mitigates such information loss by incorporating persistent homology, which captures the topological characteristics of data that remain invariant under transformations. In this paper, we treat the temporal and topological properties of time series data as distinct modalities. Specifically, we compute persistent homology to construct topological features of time series data, representing them in persistence diagrams. We then design a neural network to encode these persistent diagrams. Our approach jointly optimizes CL within the time modality and time-topology correspondence, promoting a comprehensive understanding of both temporal semantics and topological properties of time series. We conduct extensive experiments on four downstream tasks-classification, anomaly detection, forecasting, and transfer learning. The results demonstrate that TopoCL achieves state-of-the-art performance.
通用时间序列表示学习在实际应用(如分类、异常检测和预测)中具有挑战性但极具价值。最近,对比学习(CL)被积极研究用于解决时间序列的表示问题。然而,一个关键挑战是CL中的数据增强过程可能会扭曲季节模式或时间依赖关系,从而不可避免地导致语义信息丢失。为了解决这一挑战,我们提出了适用于时间序列的拓扑对比学习(TopoCL)。TopoCL通过引入持久同调来缓解这种信息损失,这种方法能够捕捉在变换下不变的数据拓扑特征。 在这篇论文中,我们将时间序列数据的时间和拓扑属性视为不同的模式。具体而言,我们利用持久同调计算时间序列数据的拓扑特性,并将它们表示为持久图。然后,我们设计了一个神经网络来编码这些持久图。我们的方法在时间模态内优化CL,并同时优化时间和拓扑之间的对应关系,从而促进对时间序列的时间语义和拓扑属性的全面理解。 我们在四个下游任务(分类、异常检测、预测和迁移学习)上进行了广泛的实验。结果表明,TopoCL达到了最先进的性能水平。
https://arxiv.org/abs/2502.02924
Recently, learning effective representations of urban regions has gained significant attention as a key approach to understanding urban dynamics and advancing smarter cities. Existing approaches have demonstrated the potential of leveraging mobility data to generate latent representations, providing valuable insights into the intrinsic characteristics of urban areas. However, incorporating the temporal dynamics and detailed semantics inherent in human mobility patterns remains underexplored. To address this gap, we propose a novel urban region representation learning model, Mobility Time Series Contrastive Learning for Urban Region Representations (MobiCLR), designed to capture semantically meaningful embeddings from inflow and outflow mobility patterns. MobiCLR uses contrastive learning to enhance the discriminative power of its representations, applying an instance-wise contrastive loss to capture distinct flow-specific characteristics. Additionally, we develop a regularizer to align output features with these flow-specific representations, enabling a more comprehensive understanding of mobility dynamics. To validate our model, we conduct extensive experiments in Chicago, New York, and Washington, D.C. to predict income, educational attainment, and social vulnerability. The results demonstrate that our model outperforms state-of-the-art models.
最近,学习城市区域的有效表示方法受到了广泛关注,被视为理解城市动态并推动智慧城市发展的重要途径。现有的研究方法展示了利用移动数据生成潜在表示的潜力,为了解城市地区的内在特征提供了宝贵的见解。然而,将人类移动模式中固有的时间动态和详细语义纳入考虑仍是一个未被充分探索的领域。 为了填补这一空白,我们提出了一种新型的城市区域表示学习模型——MobiCLR(Mobility Time Series Contrastive Learning for Urban Region Representations),该模型旨在从流入和流出的移动模式中捕捉具有语义意义的嵌入。MobiCLR采用对比学习来增强其表示的区分能力,并应用实例级对比损失以捕获特定流动的独特特性。此外,我们开发了一种正则化器来使输出特征与这些特定于流动的表示相一致,从而实现对移动动态更全面的理解。 为了验证我们的模型效果,我们在芝加哥、纽约和华盛顿特区进行了广泛实验,预测收入水平、教育程度和社会脆弱性。结果表明,我们的模型优于现有的最先进的模型。
https://arxiv.org/abs/2502.02912
Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments.
元强化学习(Meta reinforcement learning)的目标是开发能够泛化到从未见过的任务上去的策略,这些任务是从一个任务分布中采样的。虽然基于上下文的元强化学习方法通过使用任务潜在变量改进了任务表示,但它们在处理与训练数据分布不同的未知任务时常常遇到困难。为此,我们提出了任务感知虚拟训练(Task-Aware Virtual Training, TAVT),这是一种新颖的算法,它利用度量基准表征学习,在训练和未见任务场景中准确地捕捉任务特性。我们的方法成功地保留了虚拟任务中的任务特征,并采用了一种状态正则化技术来缓解在环境状态变化时的状态估计误差。数值结果表明,TAVT显著提升了对各种MuJoCo和MetaWorld环境中未知任务的泛化能力。
https://arxiv.org/abs/2502.02834
Effective self-supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains in its early stages. We present a self-supervised masked modeling framework for 3D particle trajectory analysis in Time Projection Chambers (TPCs). These detectors produce globally sparse (<1% occupancy) but locally dense point clouds, capturing meter-scale particle trajectories at millimeter resolution. Starting with PointMAE, this work proposes volumetric tokenization to group sparse ionization points into resolution-agnostic patches, as well as an auxiliary energy infilling task to improve trajectory semantics. This approach -- which we call Point-based Liquid Argon Masked Autoencoder (PoLAr-MAE) -- achieves 99.4% track and 97.7% shower classification F-scores, matching that of supervised baselines without any labeled data. While the model learns rich particle trajectory representations, it struggles with sub-token phenomena like overlapping or short-lived particle trajectories. To support further research, we release PILArNet-M -- the largest open LArTPC dataset (1M+ events, 5.2B labeled points) -- to advance SSL in high energy physics (HEP). Project site: this https URL
有效的自监督学习(SSL)技术已成为解锁大型数据集进行表示学习的关键。尽管许多有前景的方法已经通过在线语料库和配有说明的照片得到了发展,但它们在科学领域的应用——这些领域中的数据编码了高度专业化的知识——仍处于早期阶段。我们提出了一种用于时间投影室(TPC)中3D粒子轨迹分析的自监督掩码建模框架。这些探测器产生的全局稀疏(<1%占用率)但局部密集的点云,以毫米级分辨率捕捉到米级的粒子轨迹。 基于PointMAE的工作提出了体积标记化来将稀疏的离子化点分组为与分辨率无关的补丁,并引入了一个辅助能量填充任务来改进轨迹语义。我们称这种方法为基于点的液氩掩码自编码器(PoLAr-MAE),它在没有标注数据的情况下,达到了99.4%的追踪和97.7%的 Shower分类F分数,与监督基线相匹配。 尽管该模型能够学习到丰富的粒子轨迹表示,但它仍难以处理如重叠或短寿命粒子轨迹这样的亚标记现象。为了支持进一步的研究,我们发布了PILArNet-M——一个最大的开放LArTPC数据集(超过100万事件,52亿个标注点),以推动高能物理领域中的自监督学习。 项目网站:[这个链接](https://this-url.com)
https://arxiv.org/abs/2502.02558
Ultrasound (US) imaging is clinically invaluable due to its noninvasive and safe nature. However, interpreting US images is challenging, requires significant expertise, and time, and is often prone to errors. Deep learning offers assistive solutions such as segmentation. Supervised methods rely on large, high-quality, and consistently labeled datasets, which are challenging to curate. Moreover, these methods tend to underperform on out-of-distribution data, limiting their clinical utility. Self-supervised learning (SSL) has emerged as a promising alternative, leveraging unlabeled data to enhance model performance and generalisability. We introduce a contrastive SSL approach tailored for B-mode US images, incorporating a novel Relation Contrastive Loss (RCL). RCL encourages learning of distinct features by differentiating positive and negative sample pairs through a learnable metric. Additionally, we propose spatial and frequency-based augmentation strategies for the representation learning on US images. Our approach significantly outperforms traditional supervised segmentation methods across three public breast US datasets, particularly in data-limited scenarios. Notable improvements on the Dice similarity metric include a 4% increase on 20% and 50% of the BUSI dataset, nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4% and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively. Furthermore, we demonstrate superior generalisability on the out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6% compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST training data, respectively. Our research highlights that domain-inspired SSL can improve US segmentation, especially under data-limited conditions.
超声成像(US)由于其非侵入性和安全性,在临床应用中具有不可估量的价值。然而,解读超声图像是一项挑战,需要深厚的专业知识和时间,并且容易出错。深度学习提供了诸如分割等辅助解决方案。监督方法依赖于大规模、高质量且一致标注的数据集,这类数据集的整理极具挑战性。此外,这些方法在处理非标准数据时表现不佳,限制了其临床应用的价值。自我监督学习(SSL)作为一种有前景的选择出现,通过利用未标记的数据来增强模型的表现力和泛化能力。我们引入了一种针对B模式超声图像的对比SSL方法,并采用了新颖的关系对比损失(RCL)。RCL通过一个可学习的度量标准区分正负样本对,从而鼓励学习出独特的特征。此外,我们还提出了基于空间和频率的增强策略以促进在US图像上的表示学习。 我们的方法在三个公开乳腺超声数据集上显著优于传统的监督分割方法,特别是在数据有限的情况下。Dice相似性指标中的改进尤为明显:在BUSI数据集的20%和50%数据量下分别提高了4%,BrEaST数据集分别为近6%和9%,UDIAT数据集中则为6.4%和3.7%(对应于20%和50%的数据)。此外,我们展示了在非标准数据集UDIAT上的优越泛化能力,相较于使用20%和50%的BUSI及BrEaST训练数据的监督基线模型,分别提高了20.6%和13.6%。我们的研究强调了领域启发式SSL在超声分割中的提升作用,尤其是在数据有限的情况下尤为显著。
https://arxiv.org/abs/2502.02489
Recent advancements in foundation models have transformed computer vision, driving significant performance improvements across diverse domains, including digital histopathology. However, the advantages of domain-specific histopathology foundation models over general-purpose models for specialized tasks such as cell analysis remain underexplored. This study investigates the representation learning gap between these two categories by analyzing multi-level patch embeddings applied to cell instance segmentation and classification. We implement an encoder-decoder architecture with a consistent decoder and various encoders. These include convolutional, vision transformer (ViT), and hybrid encoders pre-trained on ImageNet-22K or LVD-142M, representing general-purpose foundation models. These are compared against ViT encoders from the recently released UNI, Virchow2, and Prov-GigaPath foundation models, trained on patches extracted from hundreds of thousands of histopathology whole-slide images. The decoder integrates patch embeddings from different encoder depths via skip connections to generate semantic and distance maps. These maps are then post-processed to create instance segmentation masks where each label corresponds to an individual cell and to perform cell-type classification. All encoders remain frozen during training to assess their pre-trained feature extraction capabilities. Using the PanNuke and CoNIC histopathology datasets, and the newly introduced Nissl-stained CytoDArk0 dataset for brain cytoarchitecture studies, we evaluate instance-level detection, segmentation accuracy, and cell-type classification. This study provides insights into the comparative strengths and limitations of general-purpose vs. histopathology foundation models, offering guidance for model selection in cell-focused histopathology and brain cytoarchitecture analysis workflows.
最近在基础模型领域的进展已经彻底改变了计算机视觉领域,并且在包括数字病理学在内的多个领域实现了显著的性能改进。然而,针对特定任务(如细胞分析)而言,专门设计的数字病理学基础模型相对于通用型模型的优势仍然未被充分探索。这项研究通过分析应用于细胞实例分割和分类的多级补丁嵌入来探讨这两类模型之间的表征学习差距。我们实现了一种编码器-解码器架构,其中使用一致的解码器以及多种预训练在ImageNet-22K或LVD-142M上的卷积、视觉变换器(ViT)和混合编码器作为通用型基础模型进行比较。这些模型与最近发布的UNI、Virchow2和Prov-GigaPath基础模型的ViT编码器相比较,后者在从数十万张数字病理学全片图像中提取的补丁上进行了训练。解码器通过跳过连接整合来自不同编码深度的补丁嵌入以生成语义和距离图。然后对这些地图进行后处理以创建实例分割掩模,在其中每个标签对应于单个细胞,并执行细胞类型分类。在所有模型中,编码器在训练过程中保持冻结状态,以便评估其预训练特征提取能力。 利用PanNuke和CoNIC数字病理学数据集以及新引入的用于脑细观结构研究并用尼氏染色的CytoDArk0数据集,我们对实例级检测、分割准确性和细胞类型分类进行了评价。本研究表明了通用型模型与针对特定任务优化的基础模型在细胞聚焦数字病理学和脑细观结构分析工作流程中的相对优缺点,并为模型选择提供了指导。
https://arxiv.org/abs/2502.02471
Graph Neural Networks (GNNs) have significant advantages in handling non-Euclidean data and have been widely applied across various areas, thus receiving increasing attention in recent years. The framework of GNN models mainly includes the information propagation phase and the aggregation phase, treating nodes and edges as information entities and propagation channels, respectively. However, most existing GNN models face the challenge of disconnection between node and edge feature information, as these models typically treat the learning of edge and node features as independent tasks. To address this limitation, we aim to develop an edge-empowered graph feature preference learning framework that can capture edge embeddings to assist node embeddings. By leveraging the learned multidimensional edge feature matrix, we construct multi-channel filters to more effectively capture accurate node features, thereby obtaining the non-local structural characteristics and fine-grained high-order node features. Specifically, the inclusion of multidimensional edge information enhances the functionality and flexibility of the GNN model, enabling it to handle complex and diverse graph data more effectively. Additionally, integrating relational representation learning into the message passing framework allows graph nodes to receive more useful information, thereby facilitating node representation learning. Finally, experiments on four real-world heterogeneous graphs demonstrate the effectiveness of theproposed model.
图神经网络(GNNs)在处理非欧几里得数据方面具有显著优势,并已在多个领域广泛应用,因此近年来受到了越来越多的关注。GNN模型的框架主要包含信息传播阶段和聚合阶段,分别将节点和边视为信息实体与传播渠道。然而,大多数现有的GNN模型面临的一个挑战是节点和边特征信息之间的脱节问题,因为这些模型通常将边和节点特征的学习视为独立的任务来处理。为了解决这一局限性,我们旨在开发一个以边赋能的图特征偏好学习框架,该框架能够捕获边嵌入来辅助节点嵌入。通过利用学到的多维边特征矩阵,我们可以构建多通道滤波器来更有效地捕捉准确的节点特征,从而获得非局部结构特性和细粒度高阶节点特征。具体来说,纳入多维边信息增强了GNN模型的功能和灵活性,使其能够更加高效地处理复杂的多样化图数据。此外,在消息传递框架中整合关系表示学习使得图节点能够接收到更多有用的信息,从而促进节点表示的学习过程。最后,通过四个真实世界异构图上的实验验证了所提出模型的有效性。
https://arxiv.org/abs/2502.02302
Ubiquitous geometric objects can be precisely and efficiently represented as polyhedra. The transformation of a polyhedron into a vector, known as polyhedra representation learning, is crucial for manipulating these shapes with mathematical and statistical tools for tasks like classification, clustering, and generation. Recent years have witnessed significant strides in this domain, yet most efforts focus on the vertex sequence of a polyhedron, neglecting the complex surface modeling crucial in real-world polyhedral objects. This study proposes \textbf{PolyhedronNet}, a general framework tailored for learning representations of 3D polyhedral objects. We propose the concept of the surface-attributed graph to seamlessly model the vertices, edges, faces, and their geometric interrelationships within a polyhedron. To effectively learn the representation of the entire surface-attributed graph, we first propose to break it down into local rigid representations to effectively learn each local region's relative positions against the remaining regions without geometric information loss. Subsequently, we propose PolyhedronGNN to hierarchically aggregate the local rigid representation via intra-face and inter-face geometric message passing modules, to obtain a global representation that minimizes information loss while maintaining rotation and translation invariance. Our experimental evaluations on four distinct datasets, encompassing both classification and retrieval tasks, substantiate PolyhedronNet's efficacy in capturing comprehensive and informative representations of 3D polyhedral objects. Code and data are available at {this https URL}.
普遍的几何对象可以精确且高效地表示为多面体。将一个多面体转换成向量,即所谓的多面体表示学习,在使用数学和统计工具来处理诸如分类、聚类和生成等任务时至关重要。近年来,在这一领域取得了显著进展,然而大多数努力都集中在多面体的顶点序列上,忽略了在现实世界中的多面体对象中至关重要的复杂表面建模。 本研究提出了一种名为**PolyhedronNet**的一般框架,旨在学习三维多面体对象的表示。我们提出了表面属性图的概念,以便无缝地模拟一个多面体内的顶点、边和面及其几何关系。为了有效地学习整个表面属性图的表示,首先我们将它分解为局部刚性表示,这样可以在不丢失任何几何信息的情况下有效学习每个局部区域相对于其他区域的相对位置。 随后,我们提出了PolyhedronGNN模型,通过内部(intra-face)和跨面(inter-face)的几何消息传递模块层次化地聚合这些局部刚性表示。这可以获取一个全局表示,最大限度地减少信息损失同时保持旋转和平移不变性。 我们在四个不同的数据集上进行了实验评估,涵盖了分类和检索任务,结果证明了PolyhedronNet在捕捉三维多面体对象的全面且富有信息量的表示方面的有效性。代码和数据可以在以下链接获取:{this https URL}。
https://arxiv.org/abs/2502.01814
With the recent advancements in generative AI such as GAN, Diffusion, and VAE, the use of generative AI for dance generation has seen significant progress and received considerable interest. In this study, We propose R-Lodge, an enhanced version of Lodge. R-Lodge incorporates Recurrent Sequential Representation Learning named Dance Recalibration to original coarse-to-fine long dance generation model. R-Lodge utilizes Dance Recalibration method using $N$ Dance Recalibration Block to address the lack of consistency in the coarse dance representation of the Lodge model. By utilizing this method, each generated dance motion incorporates a bit of information from the previous dance motions. We evaluate R-Lodge on FineDance dataset and the results show that R-Lodge enhances the consistency of the whole generated dance motions.
随着生成式AI(如GAN、扩散模型和VAE)的最新进展,利用生成式AI进行舞蹈生成已经取得了显著的进步,并引起了广泛的关注。在这项研究中,我们提出了R-Lodge,这是Lodge的一个增强版本。R-Lodge在原始的从粗到细的长舞蹈生成模型中引入了一种名为Dance Recalibration的递归序列表示学习方法。通过使用$N$个Dance Recalibration Block,R-Lodge解决了Lodge模型中的粗略舞蹈表示缺乏一致性的难题。这种方法使得每个生成的舞蹈动作都包含了一些来自前一个舞蹈动作的信息。 我们在FineDance数据集上评估了R-Lodge,并且结果显示,R-Lodge增强了整个生成舞蹈的一致性。
https://arxiv.org/abs/2502.01190
Molecular property prediction uses molecular structure to infer chemical properties. Chemically interpretable representations that capture meaningful intramolecular interactions enhance the usability and effectiveness of these predictions. However, existing methods often rely on atom-based or rule-based fragment tokenization, which can be chemically suboptimal and lack scalability. We introduce FragmentNet, a graph-to-sequence foundation model with an adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments while preserving structural connectivity. FragmentNet integrates VQVAE-GCN for hierarchical fragment embeddings, spatial positional encodings for graph serialization, global molecular descriptors, and a transformer. Pre-trained with Masked Fragment Modeling and fine-tuned on MoleculeNet tasks, FragmentNet outperforms models with similarly scaled architectures and datasets while rivaling larger state-of-the-art models requiring significantly more resources. This novel framework enables adaptive decomposition, serialization, and reconstruction of molecular graphs, facilitating fragment-based editing and visualization of property trends in learned embeddings - a powerful tool for molecular design and optimization.
分子属性预测利用分子结构来推断化学性质。能够捕捉有意义的分子内相互作用的化学解释性表示,可以提高这些预测的实用性和有效性。然而,现有的方法往往依赖于基于原子或规则的片段标记化,这在化学上可能不是最优的选择,并且缺乏可扩展性。 我们引入了FragmentNet,这是一种图到序列的基础模型,它具有一个自适应的学习分词器,能够将分子图分解为化学有效的片段,同时保持结构连接。FragmentNet结合了VQVAE-GCN进行层次化片段嵌入、空间位置编码用于图形序列化、全局分子描述符和变压器。在Masked Fragment Modeling上预训练,并在MoleculeNet任务上微调后,FragmentNet的表现超过了具有类似规模架构和数据集的模型,并且能够与需要更多资源的大规模最先进的模型匹敌。 这一新颖框架实现了分子图的自适应分解、序列化和重建,促进了基于片段编辑和学习嵌入中属性趋势的可视化——这在分子设计和优化方面是一个强大的工具。
https://arxiv.org/abs/2502.01184
Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of "environmental robustness", which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at this https URL.
最近,为了解决特定领域视觉识别中训练数据不足的问题,少量样本学习(Few-shot Learning, FSL)得到了广泛应用。在实际场景中,环境因素如复杂的背景、变化的光照条件、远距离拍摄以及移动目标等会导致测试图像出现大量的不完整目标或噪声干扰。然而,目前关于评估数据集和方法的研究大多忽视了“环境鲁棒性”的概念,即模型在复杂多变的真实环境中保持一致性能的能力。这种忽视导致FSL模型在实际应用中的表现与训练阶段相比显著下降。 为弥补这一差距,我们引入了一个新的基于真实世界的跨域少量样本学习(Real-world Multi-domain Few-shot Learning, RD-FSL)基准测试,其中包括四个领域和六个评估数据集。该测试基准中的图像包含了各种具有挑战性的元素,如伪装物体、小目标以及模糊图像。通过评估实验我们发现,现有方法在利用训练图像生成挑战性测试图像的准确特征表示方面存在困难。 为解决这一问题,我们提出了一种新的条件表征学习网络(Conditional Representation Learning Network, CRLNet),该网络将训练与测试图像之间的交互作为各自的表征过程中的条件信息。其主要目标是通过降低类内方差或增强类间方差来改进特征表示层面的表现。 最后,对比实验表明CRLNet超越了当前最先进的方法,在不同设置和骨干架构下实现了6.83%到16.98%的性能提升。源代码及数据集可在该链接获取。
https://arxiv.org/abs/2502.01183
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at this https URL
图表示学习已成为节点分类和链接预测等任务的重要基石,然而现有的自监督学习(SSL)方法面临着计算效率低下、依赖对比目标以及表征坍缩等问题。现有方法通常依赖于特征重构、负采样或复杂的解码器,这些方法会增加训练开销并阻碍泛化能力。此外,当前解决这些问题的技术未能考虑到在没有标签节点的情况下节点嵌入对特定预测的贡献。为了解决这些问题,我们提出了一种新的联合嵌入预测框架用于图自监督学习,该框架消除了对比目标和负采样,同时保留语义和结构信息。此外,我们引入了一个具有语义感知的目标项,该项采用了从高斯混合模型(GMMs)推导出的伪标签,通过评估潜在特征贡献来增强节点区分度。 广泛的实验表明,我们的框架在基准测试中超越了最先进的图自监督学习方法,在没有对比损失和复杂解码器的情况下取得了更优的表现。关键创新包括: 1. 一种非对比、视图不变性的联合嵌入预测架构, 2. 利用子图间的单一上下文与多个目标之间的关系,以及 3. 基于GMM的伪标签评分以捕捉语义贡献。 这项工作通过提供一个计算效率高且不易坍缩的方法推进了图SSL的发展,并将空间和语义图特征相结合用于下游任务。我们的论文代码可在以下网址找到:[此处插入链接]
https://arxiv.org/abs/2502.01684
Recent years have witnessed rapid advances in graph representation learning, with the continuous embedding approach emerging as the dominant paradigm. However, such methods encounter issues regarding parameter efficiency, interpretability, and robustness. Thus, Quantized Graph Representation (QGR) learning has recently gained increasing interest, which represents the graph structure with discrete codes instead of conventional continuous embeddings. Given its analogous representation form to natural language, QGR also possesses the capability to seamlessly integrate graph structures with large language models (LLMs). As this emerging paradigm is still in its infancy yet holds significant promise, we undertake this thorough survey to promote its rapid future prosperity. We first present the background of the general quantization methods and their merits. Moreover, we provide an in-depth demonstration of current QGR studies from the perspectives of quantized strategies, training objectives, distinctive designs, knowledge graph quantization, and applications. We further explore the strategies for code dependence learning and integration with LLMs. At last, we give discussions and conclude future directions, aiming to provide a comprehensive picture of QGR and inspire future research.
近年来,图表示学习取得了迅速进展,连续嵌入方法已成为主导范式。然而,这类方法在参数效率、可解释性和鲁棒性方面遇到了问题。因此,量化图表示(Quantized Graph Representation, QGR)学习最近引起了越来越多的关注,它使用离散代码而不是传统的连续嵌入来表示图结构。由于其与自然语言的类比表现形式,QGR还具备将图结构无缝集成到大型语言模型(LLMs)中的能力。鉴于这一新兴范式仍处于起步阶段但前景广阔,我们进行了全面调研以促进其未来的快速发展。首先,我们将介绍通用量化方法及其优势的背景信息。此外,从量化解策略、训练目标、独特设计、知识图谱量化和应用等方面深入展示了当前QGR研究。进一步地,我们探讨了代码依赖学习策略以及与LLMs集成的方法。最后,我们将进行讨论并总结未来的发展方向,旨在为QGR提供一个全面的视角,并激发未来的相关研究工作。
https://arxiv.org/abs/2502.00681
Recent generalizable fault diagnosis researches have effectively tackled the distributional shift between unseen working conditions. Most of them mainly focus on learning domain-invariant representation through feature-level methods. However, the increasing numbers of unseen domains may lead to domain-invariant features contain instance-level spurious correlations, which impact the previous models' generalizable ability. To address the limitations, we propose the Fourier-based Augmentation Reconstruction Network, namely this http URL methods are motivated by the observation that the Fourier phase component and amplitude component preserve different semantic information of the signals, which can be employed in domain augmentation techniques. The network comprises an amplitude spectrum sub-network and a phase spectrum sub-network, sequentially reducing the discrepancy between the source and target domains. To construct a more robust generalized model, we employ a multi-source domain data augmentation strategy in the frequency domain. Specifically, a Frequency-Spatial Interaction Module (FSIM) is introduced to handle global information and local spatial features, promoting representation learning between the two sub-networks. To refine the decision boundary of our model output compared to conventional triplet loss, we propose a manifold triplet loss to contribute to generalization. Through extensive experiments on the CWRU and SJTU datasets, FARNet demonstrates effective performance and achieves superior results compared to current cross-domain approaches on the benchmarks.
近期的通用故障诊断研究有效地解决了未见工作条件下分布偏移的问题。大多数这些研究主要集中在通过特征层面的方法学习领域不变表示上。然而,随着越来越多未见领域的出现,领域不变特征可能包含实例级别的虚假关联,这会影响先前模型的泛化能力。为了解决这一局限性,我们提出了基于傅里叶增强重构网络(简称FARNet),该方法受到信号的傅立叶相位成分和幅度成分分别保存不同语义信息这一观察结果的启发,这些成分可以在领域增强技术中加以利用。 该网络包含一个幅度谱子网和一个相位谱子网,依次减少源域与目标域之间的差异。为了构建更加稳健的泛化模型,我们在频域内采用了一种多源领域的数据增强策略。具体来说,引入了一个频域-空间交互模块(FSIM),以处理全局信息和局部空间特征,并促进两个子网络之间的表示学习。 为细化我们模型输出与传统三元组损失相比的决策边界,我们提出了一种流形三元组损失方法,这有助于提高泛化能力。通过在CWRU和SJTU数据集上进行广泛的实验,FARNet展示了有效的性能,并且在基准测试中相较于当前跨域方法取得了优越的结果。
https://arxiv.org/abs/2502.00545
Spatially embedded networks (SENs) represent a special type of complex graph, whose topologies are constrained by the networks' embedded spatial environments. The graph representation of such networks is thereby influenced by the embedded spatial features of both nodes and edges. Accurate network representation of the graph structure and graph features is a fundamental task for various graph-related tasks. In this study, a Generic Multimodal Spatially Graph Convolutional Network (GMu-SGCN) is developed for efficient representation of spatially embedded networks. The developed GMu-SGCN model has the ability to learn the node connection pattern via multimodal node and edge features. In order to evaluate the developed model, a river network dataset and a power network dataset have been used as test beds. The river network represents the naturally developed SENs, whereas the power network represents a man-made network. Both types of networks are heavily constrained by the spatial environments and uncertainties from nature. Comprehensive evaluation analysis shows the developed GMu-SGCN can improve accuracy of the edge existence prediction task by 37.1\% compared to a GraphSAGE model which only considers the node's position feature in a power network test bed. Our model demonstrates the importance of considering the multidimensional spatial feature for spatially embedded network representation.
空间嵌入网络(SENs)代表了一种特殊的复杂图,其拓扑结构受到所处空间环境的限制。这种网络的图表示因此会受到节点和边的空间嵌入特征的影响。准确地对这类图进行建模和提取其结构及特性是许多与图相关任务的基础性工作。本研究开发了一种通用多模态空间图形卷积网络(GMu-SGCN),以高效地表示空间嵌入网络。所开发的GMu-SGCN模型能够通过多元节点和边特征学习节点连接模式。为了评估该模型,使用了河流网络数据集和电力网络数据集作为测试平台。其中,河流网络代表自然生成的空间嵌入网络(SENs),而电力网络则代表人为建造的网络。这两种类型的网络都受到空间环境及自然界不确定性的严格限制。 综合评价分析表明,在考虑电力网络时,相较于仅依赖节点位置特征的GraphSAGE模型,所开发的GMu-SGCN能够将边的存在预测任务准确性提升37.1%。我们的研究强调了在表示空间嵌入网络时考虑到多维度空间特征的重要性。
https://arxiv.org/abs/2502.00530
Can integrating spectral and curvature signals unlock new potential in graph representation learning? Non-Euclidean geometries, particularly Riemannian manifolds such as hyperbolic (negative curvature) and spherical (positive curvature), offer powerful inductive biases for embedding complex graph structures like scale-free, hierarchical, and cyclic patterns. Meanwhile, spectral filtering excels at processing signal variations across graphs, making it effective in homophilic and heterophilic settings. Leveraging both can significantly enhance the learned representations. To this end, we propose Spectro-Riemannian Graph Neural Networks (CUSP) - the first graph representation learning paradigm that unifies both CUrvature (geometric) and SPectral insights. CUSP is a mixed-curvature spectral GNN that learns spectral filters to optimize node embeddings in products of constant-curvature manifolds (hyperbolic, spherical, and Euclidean). Specifically, CUSP introduces three novel components: (a) Cusp Laplacian, an extension of the traditional graph Laplacian based on Ollivier-Ricci curvature, designed to capture the curvature signals better; (b) Cusp Filtering, which employs multiple Riemannian graph filters to obtain cues from various bands in the eigenspectrum; and (c) Cusp Pooling, a hierarchical attention mechanism combined with a curvature-based positional encoding to assess the relative importance of differently curved substructures in our graph. Empirical evaluation across eight homophilic and heterophilic datasets demonstrates the superiority of CUSP in node classification and link prediction tasks, with a gain of up to 5.3% over state-of-the-art models.
将光谱信号和曲率信号整合在一起能否解锁图表示学习的新潜力?非欧几里得几何,特别是黎曼流形(如具有负曲率的双曲空间和具有正曲率的球面),为嵌入复杂图结构提供了强大的归纳偏差,例如规模自由、层级化和循环模式。同时,光谱滤波在处理图上信号变化方面表现出色,在同质和异质设置中都十分有效。利用这两者可以显著增强学习到的表示能力。 为此,我们提出了Spectro-Riemannian图神经网络(CUSP)——第一个统一了曲率(几何)与光谱洞察力的图表示学习范式。CUSP是一个混合曲率光谱GNN,它在常曲率流形(双曲、球面和欧几里得空间)的乘积中学习光谱滤波器以优化节点嵌入。 具体来说,CUSP引入了三个新颖组件: (a) Cusp Laplacian:基于Ollivier-Ricci曲率对传统图Laplacian的扩展设计,用于更好地捕捉曲率信号; (b) Cusp Filtering:采用多个黎曼图滤波器以从特征值谱的不同频带中获取提示信息; (c) Cusp Pooling:结合了基于曲率的位置编码与层次注意力机制来评估不同曲率子结构在图中的相对重要性。 跨八种同质和异质数据集的实证评估显示,CUSP在节点分类和链接预测任务上显著优于当前最先进的模型,在某些情况下提高了高达5.3%的表现。
https://arxiv.org/abs/2502.00401
Causal Bayesian networks are 'causal' models since they make predictions about interventional distributions. To connect such causal model predictions to real-world outcomes, we must determine which actions in the world correspond to which interventions in the model. For example, to interpret an action as an intervention on a treatment variable, the action will presumably have to a) change the distribution of treatment in a way that corresponds to the intervention, and b) not change other aspects, such as how the outcome depends on the treatment; while the marginal distributions of some variables may change as an effect. We introduce a formal framework to make such requirements for different interpretations of actions as interventions precise. We prove that the seemingly natural interpretation of actions as interventions is circular: Under this interpretation, every causal Bayesian network that correctly models the observational distribution is trivially also interventionally valid, and no action yields empirical data that could possibly falsify such a model. We prove an impossibility result: No interpretation exists that is non-circular and simultaneously satisfies a set of natural desiderata. Instead, we examine non-circular interpretations that may violate some desiderata and show how this may in turn enable the falsification of causal models. By rigorously examining how a causal Bayesian network could be a 'causal' model of the world instead of merely a mathematical object, our formal framework contributes to the conceptual foundations of causal representation learning, causal discovery, and causal abstraction, while also highlighting some limitations of existing approaches.
因果贝叶斯网络被称为“因果”模型,因为它们能够对干预分布进行预测。为了将这种因果模型的预测与现实世界的成果联系起来,我们必须确定哪些世界中的行动对应于模型中的哪些干预措施。例如,要解释一个行为作为治疗变量上的干预,该行动应当 a) 以一种与干预相对应的方式改变治疗的分布,并且 b) 不影响其他方面,比如结果如何依赖于治疗;尽管一些变量的边缘分布可能会发生改变作为副作用。我们引入了一个正式框架来使对不同行动解释为干预的要求精确化。 我们证明了看似自然的行为解释为干预的方式是循环的:在这种解释下,每个正确模拟观测分布的因果贝叶斯网络在干预上也是有效的,并且没有行为会产生可能证伪这种模型的经验数据。我们得到了一个不可能的结果:不存在一种非循环的解释同时满足一组自然的理想期望。 相反,我们考察了那些虽然违反某些理想但不循环的解释方式,并展示了这可能会使因果模型被证伪成为可能。通过严格审查如何才能使一个因果贝叶斯网络成为一个世界“因果”模型而非仅仅是一个数学对象,我们的正式框架对因果表示学习、因果发现和因果抽象的概念基础作出了贡献,同时强调了现有方法的一些局限性。 这段翻译概述了一个理论框架的构建,该框架探讨了如何将行动解释为干预以及这种解释在因果贝叶斯网络中的作用。通过证明某些看似自然的行为到干预的映射是循环且无效的方式,研究者们指出了一个重要的逻辑陷阱,并进一步展示了非循环解释的潜在价值和局限性,这对于理解和改进因果模型的验证方法具有重要意义。
https://arxiv.org/abs/2501.19335
The effectiveness of credit assignment in reinforcement learning (RL) when dealing with high-dimensional data is influenced by the success of representation learning via deep neural networks, and has implications for the sample efficiency of deep RL algorithms. Input decorrelation has been previously introduced as a method to speed up optimization in neural networks, and has proven impactful in both efficient deep learning and as a method for effective representation learning for deep RL algorithms. We propose a novel approach to online decorrelation in deep RL based on the decorrelated backpropagation algorithm that seamlessly integrates the decorrelation process into the RL training pipeline. Decorrelation matrices are added to each layer, which are updated using a separate decorrelation learning rule that minimizes the total decorrelation loss across all layers, in parallel to minimizing the usual RL loss. We used our approach in combination with the soft actor-critic (SAC) method, which we refer to as decorrelated soft actor-critic (DSAC). Experiments on the Atari 100k benchmark with DSAC shows, compared to the regular SAC baseline, faster training in five out of the seven games tested and improved reward performance in two games with around 50% reduction in wall-clock time, while maintaining performance levels on the other games. These results demonstrate the positive impact of network-wide decorrelation in deep RL for speeding up its sample efficiency through more effective credit assignment.
在处理高维数据时,强化学习(RL)中的信用分配的有效性受深度神经网络表示学习成功的影响,并对深度RL算法的样本效率产生影响。输入去相关化之前被引入作为一种加快神经网络优化的方法,在高效深度学习和作为有效的深度RL算法表示学习方法中已经显示出重要效果。我们提出了一种基于解相关反向传播算法的新颖在线去相关策略,该策略可以无缝地将去相关过程整合到RL训练流程中。 在每个层中添加了解相关矩阵,并使用单独的解相关学习规则进行更新,在最小化所有层次总解相关损失的同时,还与最小化常规RL损失并行工作。我们将这种方法与软演员评论家(SAC)方法结合在一起,称之为去相关软演员评论家(DSAC)。在Atari 10万帧基准测试中使用DSAC时,相比于标准的SAC基线,在七款游戏中中有五款游戏训练速度更快,并且有两款游戏的奖励性能得到了大约50%的时间节省,同时保持了其他游戏的表现水平。这些结果表明,全网去相关化可以显著提升深度RL的样本效率,通过更有效的信用分配实现加速效果。
https://arxiv.org/abs/2501.19133