Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.
选择性注意帮助我们将注意力集中在与任务相关的感官输入中的不断涌现的方面。这种感知约束使我们能够在分心的情况下稳健地推广,并研究可感知概念的新组合。变压器在架构中采用与注意力类似的观念,但是使用变换器骨干的表示学习模型(如CLIP和DINO)通常无法展示稳健性和可组合性。我们突出了一个缺失的架构先验:与人类感知不同,变压器编码不分别关注单个概念。为了应对这一问题,我们提出了SPARO,一种输出机制,将编码分为由单个注意头生成的单独关注的位置。使用SPARO与CLIP相结合,为视觉和文本模式提供归纳偏见,即视觉和文本模式是具有相同相应概念的共享组合世界的不同视图。使用SPARO,我们在CLIP(ImageNet上的改进超过+14%,SugarCrepe上的改进超过+4%)和DINO(ImageNet上的改进超过+3% each)的下游识别、鲁棒性、检索和可组合性基准测试中取得了改善,并且通过与最近邻和线性探针结合使用SPARO(改进超过+4%,从SugarCrepe的+4%到ImageNet的+9%)证明了强大的干预和选择单个SPARO概念的能力,进一步提高了下游任务的性能(从SugarCrepe的+4%到ImageNet的+9%的改进)。我们还通过消融实验和概念可视化展示了SPARO表示结构的稳健性。最后,我们提供了通过消融实验和可视化学习到的概念的见解。
https://arxiv.org/abs/2404.15721
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge, allowing each individual component model to learn maximally distinct, complementary representations. We make three detailed explanations of why this works and experimental results demonstrate that our method more efficiently improves performance compared to traditional MMF. Furthermore, attribution analysis validates the model trained under ACoRL acquires more complementary knowledge, highlighting the efficacy of our approach in enhancing efficiency and robustness across tasks.
单模型系统通常在诸如演讲验证(SV)和图像分类等任务中存在不足,因此在决策过程中严重依赖先验知识,导致性能较低。尽管多模型融合(MMF)可以在一定程度上减轻这些问题,但学习到的表示的冗余可能限制了提高。为此,我们提出了一个对抗性互补表示学习(ACoRL)框架,使新训练的模型能够避免之前获得的知识,使得每个组件模型能够学习到最独特的互补表示。我们详细解释了这种方法的工作原理,并进行了实验验证,表明与传统MMF相比,我们的方法能更有效地提高性能。此外,归因分析证实,在ACoRL框架下训练的模型获得了更多的互补知识,这表明我们的方法在提高任务效率和鲁棒性方面具有有效性。
https://arxiv.org/abs/2404.15704
The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.
Vision Transformer (ViT) 在自监督学习 (SSL) 中对 3D 医疗图像分析的性能已经取得了显著的突破。对于特征预训练,采用掩码自动编码器(MAE)进行预训练可能进一步释放 ViT 在各种医疗视觉任务上的潜力。然而,由于 3D 医疗图像具有较大的空间尺寸和高维,MAE 可能缺乏层次结构设计,这可能会阻碍下游任务的性能。在本文中,我们提出了一个名为“掩码在掩码(MiM)”预训练框架,旨在通过从不同尺度之间的层次视觉令牌中学习具有判别性的表示来提高 MAE 的性能。我们引入了多个级别的粒度来处理掩码输入的体积,然后同时在不同精度和粗略水平上进行重建。此外,在相邻级别卷积中应用跨层对齐机制,以确保解剖结构相似性层次结构的强制。此外,我们还采用了一种混合骨干网络来提高预训练期间层次表示学习的高效性。MiM 在包括全身各个部位的较大范围内预先训练,这些预训练数据集包含各种器官/病灶/肿瘤。在十三个公开数据集上的广泛实验证明,MiM 在其他 SSL 方法在器官/病灶/肿瘤分割和疾病分类方面具有优越性。我们进一步将 MiM 扩展到具有超过 10k 个卷积的预训练大样本数据集,表明大规模预训练可以进一步增强下游任务的性能。这种改进还得出结论,研究社区应更加关注 healthcare foundation model 用于 3D 医疗图像的预训练数据集的大小。
https://arxiv.org/abs/2404.15580
Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.
将预训练的图神经网络提取可转移知识并将其应用于下游任务的实际标准已经成为了图形表示学习的事实标准。 最近的工作集中在设计自监督的预训练任务,以从大规模未标注数据中提取有用的和通用的可转移知识。 然而,他们必须面对一个不可避免的质疑: 旨在提取预训练任务的有用信息的传统预训练策略,可能无法提取下游任务的全部有用信息。 在本文中,我们重新审视了传统预训练和微调框架中的预训练过程,从信息瓶颈(IB)的角度出发,证实了预训练阶段遗忘现象可能会对下游任务造成严重损害。 因此,我们提出了一个新颖的 \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP)框架,该框架在预训练阶段通过抑制压缩操作来尽可能保持潜在表示和训练数据之间的互信息,并将压缩操作延迟到微调阶段,以确保压缩可以引导有标签的微调数据和下游任务。 为了实现这一目标,我们设计了一个可以直接优化且可以进一步集成到实际模型设计中的两个信息控制目标。 在化学和生物学领域进行的大量实验证明DBP的有效性。
https://arxiv.org/abs/2404.14941
Anomaly detection in real-world scenarios poses challenges due to dynamic and often unknown anomaly distributions, requiring robust methods that operate under an open-world assumption. This challenge is exacerbated in practical settings, where models are employed by private organizations, precluding data sharing due to privacy and competitive concerns. Despite potential benefits, the sharing of anomaly information across organizations is restricted. This paper addresses the question of enhancing outlier detection within individual organizations without compromising data confidentiality. We propose a novel method leveraging representation learning and federated learning techniques to improve the detection of unknown anomalies. Specifically, our approach utilizes latent representations obtained from client-owned autoencoders to refine the decision boundary of inliers. Notably, only model parameters are shared between organizations, preserving data privacy. The efficacy of our proposed method is evaluated on two standard financial tabular datasets and an image dataset for anomaly detection in a distributed setting. The results demonstrate a strong improvement in the classification of unknown outliers during the inference phase for each organization's model.
在现实场景中,异常检测 poses 挑战,因为动态且往往不确定的异常分布,需要操作在开放世界假设上的稳健方法。在实际场景中,私人组织使用模型,这使得数据无法共享,因为隐私和竞争担忧。尽管存在潜在好处,但组织之间共享异常信息受到限制。本文回答了一个问题:在保留数据机密性的前提下,如何提高组织内部的个人异常检测。我们提出了一种利用客户端自定义的自动编码器的隐式表示来改善未知的异常检测的新方法。具体来说,我们的方法利用客户端自定义的自动
https://arxiv.org/abs/2404.14933
Fine-tuning Pre-trained protein language models (PLMs) has emerged as a prominent strategy for enhancing downstream prediction tasks, often outperforming traditional supervised learning approaches. As a widely applied powerful technique in natural language processing, employing Parameter-Efficient Fine-Tuning techniques could potentially enhance the performance of PLMs. However, the direct transfer to life science tasks is non-trivial due to the different training strategies and data forms. To address this gap, we introduce SES-Adapter, a simple, efficient, and scalable adapter method for enhancing the representation learning of PLMs. SES-Adapter incorporates PLM embeddings with structural sequence embeddings to create structure-aware representations. We show that the proposed method is compatible with different PLM architectures and across diverse tasks. Extensive evaluations are conducted on 2 types of folding structures with notable quality differences, 9 state-of-the-art baselines, and 9 benchmark datasets across distinct downstream tasks. Results show that compared to vanilla PLMs, SES-Adapter improves downstream task performance by a maximum of 11% and an average of 3%, with significantly accelerated training speed by a maximum of 1034% and an average of 362%, the convergence rate is also improved by approximately 2 times. Moreover, positive optimization is observed even with low-quality predicted structures. The source code for SES-Adapter is available at this https URL.
预训练的蛋白质语言模型(PLMs)的微调被证明是一种增强下游预测任务的突出策略,往往比传统监督学习方法更优异。作为一种在自然语言处理中广泛应用的强大的技术,采用参数高效的微调方法可能会提高PLMs的性能。然而,由于训练策略和数据形式的不同,将微调应用于生物学任务并不是一件容易的事情。为了填补这一空白,我们引入了SES-Adapter,一种简单、高效、可扩展的适配器方法,用于增强PLMs的表示学习。SES-Adapter通过将PLM嵌入与结构序列嵌入相结合来创建结构意识表示。我们证明了所提出的方法可以兼容不同PLM架构,并在多样任务上取得良好的效果。在两个类型的折叠结构上进行了广泛的评估,包括显著的质量差异的两种PLM架构、9个最先进的基准和9个基准数据集。结果表明,与普通PLM相比,SES-Adapter通过提高下游任务性能最多11%,平均3%,以及通过最大1034%和平均362%的加速训练速度,显著改善了训练速度。此外,即使在低质量的预测结构上,也观察到了积极的优化。SES-Adapter的源代码可在此处访问:https://url.com/
https://arxiv.org/abs/2404.14850
Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.
发现一个 informative、或以 agent-为中心的状态表示,仅编码相关信息而丢弃无关信息,是扩展强化学习算法并将其应用于下游任务的关键挑战。先前的研究在高度维度的马尔可夫环境中研究了这个问题,当时当前观察可能是一个复杂的对象,但足够解密相关信息的状态。在本文中,我们考虑在更高维度的非马尔可夫设置中发现代理中心状态的问题,当时状态可以从过去的观察序列中编码。我们证明了扩展倒模模型可以用于学习这种任务的代理中心状态表示。我们的结果包括确定性动态系统设置下的渐进理论以及关于替代直觉算法的反例。我们通过对这些替代算法的实验研究来补充这些发现。特别是值得注意的是我们对过去动作的分析,我们表明这些动作可以是双刃剑:正确使用时使算法更加成功,而错误使用时会导致戏剧性的失败。
https://arxiv.org/abs/2404.14552
Imaging sites around the world generate growing amounts of medical scan data with ever more versatile and affordable technology. Large-scale studies acquire MRI for tens of thousands of participants, together with metadata ranging from lifestyle questionnaires to biochemical assays, genetic analyses and more. These large datasets encode substantial information about human health and hold considerable potential for machine learning training and analysis. This chapter examines ongoing large-scale studies and the challenge of distribution shifts between them. Transfer learning for overcoming such shifts is discussed, together with federated learning for safe access to distributed training data securely held at multiple institutions. Finally, representation learning is reviewed as a methodology for encoding embeddings that express abstract relationships in multi-modal input formats.
在世界各地的成像站点生成越来越多的医疗扫描数据,并且使用越来越多样化和经济实惠的技术。大型研究项目可以获取数十万参与者的MRI数据,以及从生活方式问卷到生物化学检测、遗传分析等元数据。这些大型数据集编码了有关人类健康的大量信息,具有很大的机器学习训练和分析潜力。本章审查了正在进行的大型研究项目和它们之间的分布转移挑战。讨论了转移学习来克服这种转移,以及分散式学习安全地存储在多个机构上的分布式训练数据的访问。最后,对表示学习作为表示多模态输入格式中抽象关系的一种方法进行了回顾。
https://arxiv.org/abs/2404.14326
Trajectory modeling refers to characterizing human movement behavior, serving as a pivotal step in understanding mobility patterns. Nevertheless, existing studies typically ignore the confounding effects of geospatial context, leading to the acquisition of spurious correlations and limited generalization capabilities. To bridge this gap, we initially formulate a Structural Causal Model (SCM) to decipher the trajectory representation learning process from a causal perspective. Building upon the SCM, we further present a Trajectory modeling framework (TrajCL) based on Causal Learning, which leverages the backdoor adjustment theory as an intervention tool to eliminate the spurious correlations between geospatial context and trajectories. Extensive experiments on two real-world datasets verify that TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.
轨迹建模指的是描述人类运动行为的过程,它在理解移动模式中是一个关键的步骤。然而,现有研究通常忽视地理上下文的影响,导致获得伪相关性和有限的泛化能力。为了弥合这一空白,我们首先提出了一个结构因果模型(SCM),从因果视角解码轨迹表示学习过程。在此基础上,我们进一步提出了一个基于因果学习的轨迹建模框架(TrajCL),该框架利用后门调整理论作为一种干预工具来消除地理上下文与轨迹之间的伪相关性。在两个真实世界数据集上的广泛实验证实,TrajCL在轨迹分类任务中的性能明显增强,同时表现出卓越的泛化和可解释性。
https://arxiv.org/abs/2404.14073
Most works studying representation learning focus only on classification and neglect regression. Yet, the learning objectives and therefore the representation topologies of the two tasks are fundamentally different: classification targets class separation, leading to disconnected representations, whereas regression requires ordinality with respect to the target, leading to continuous representations. We thus wonder how the effectiveness of a regression representation is influenced by its topology, with evaluation based on the Information Bottleneck (IB) principle. The IB principle is an important framework that provides principles for learning effectiveness representations. We establish two connections between it and the topology of regression representations. The first connection reveals that a lower intrinsic dimension of the feature space implies a reduced complexity of the representation Z. This complexity can be quantified as the conditional entropy of Z on the target space Y and serves as an upper bound on the generalization error. The second connection suggests learning a feature space that is topologically similar to the target space will better align with the IB principle. Based on these two connections, we introduce PH-Reg, a regularizer specific to regression that matches the intrinsic dimension and topology of the feature space with the target space. Experiments on synthetic and real-world regression tasks demonstrate the benefits of PH-Reg.
大多数研究代表学习的研究仅关注分类,而忽视了回归。然而,学习目标因此两个任务的表示拓扑是根本不同的:分类目标是分类,导致离散表示,而回归需要目标空间的有序性,导致连续表示。因此,我们想知道回归表示的有效性如何受到其拓扑结构的影响,评估基于信息瓶颈(IB)原理。IB原理是一个重要的学习有效性表示的框架。我们建立了它与回归表示拓扑之间的两个联系。第一个联系揭示了特征空间内特征维度较低意味着表示Z的复杂性降低。这种复杂性可以定量为目标空间Y的条件熵,作为一般化误差的上界。第二个联系建议学习一个与目标空间拓扑结构相似的特征空间将更好地符合IB原理。基于这些两个联系,我们引入了PH-Reg,一个特定于回归的 regularizer,它与特征空间的内维度和拓扑结构与目标空间的拓扑结构相匹配。在合成和现实世界的回归任务上的实验证明了PH-Reg的好处。
https://arxiv.org/abs/2404.13904
Recent techniques on implicit geometry representation learning and neural rendering have shown promising results for 3D clothed human reconstruction from sparse video inputs. However, it is still challenging to reconstruct detailed surface geometry and even more difficult to synthesize photorealistic novel views with animated human poses. In this work, we introduce PGAHum, a prior-guided geometry and appearance learning framework for high-fidelity animatable human reconstruction. We thoroughly exploit 3D human priors in three key modules of PGAHum to achieve high-quality geometry reconstruction with intricate details and photorealistic view synthesis on unseen poses. First, a prior-based implicit geometry representation of 3D human, which contains a delta SDF predicted by a tri-plane network and a base SDF derived from the prior SMPL model, is proposed to model the surface details and the body shape in a disentangled manner. Second, we introduce a novel prior-guided sampling strategy that fully leverages the prior information of the human pose and body to sample the query points within or near the body surface. By avoiding unnecessary learning in the empty 3D space, the neural rendering can recover more appearance details. Last, we propose a novel iterative backward deformation strategy to progressively find the correspondence for the query point in observation space. A skinning weights prediction model is learned based on the prior provided by the SMPL model to achieve the iterative backward LBS deformation. Extensive quantitative and qualitative comparisons on various datasets are conducted and the results demonstrate the superiority of our framework. Ablation studies also verify the effectiveness of each scheme for geometry and appearance learning.
近年来,在隐式几何表示学习和神经渲染方面的技术已经为从稀疏视频输入中实现3D带有人物的重建带来了鼓舞人心的结果。然而,从零散的视频输入中重构详细表面的挑战仍然很大,而且用动画人体姿势合成逼真的新视角也更为困难。在这项工作中,我们引入了PGAHum,一个高保真度人体建模先验指导的形状和外观学习框架。我们在PGAHum的三个关键模块中充分利用了3D人体先验信息,以实现高质量几何建模和逼真的新视图合成。首先,我们提出了一个基于先验的人体隐式几何表示,其中包含由三平面网络预测的delta SDF和基于先验SMPL模型的基础SDF,以以分离的方式建模表面细节和人体形状。其次,我们引入了一种新颖的基于先验的人体姿态和身体先验引导的采样策略,充分利用人体姿态和身体先验信息在体表附近采样查询点。通过避免在空3D空间中的无用学习,神经渲染可以恢复更多的外观细节。最后,我们提出了一种新颖的迭代反向变形策略,在观察空间中逐步找到查询点的对应关系。基于SMPL模型的先验提供了 skinning权重预测模型,以实现迭代反向LBS变形。在各种数据集上进行了广泛的定量和定性比较,结果证明了我们的框架的优势。消融研究还证实了每个方案在形状和外观学习方面的有效性。
https://arxiv.org/abs/2404.13862
Conditional independence (CI) constraints are critical for defining and evaluating fairness in machine learning, as well as for learning unconfounded or causal representations. Traditional methods for ensuring fairness either blindly learn invariant features with respect to a protected variable (e.g., race when classifying sex from face images) or enforce CI relative to the protected attribute only on the model output (e.g., the sex label). Neither of these methods are effective in enforcing CI in high-dimensional feature spaces. In this paper, we focus on a nascent approach characterizing the CI constraint in terms of two Jensen-Shannon divergence terms, and we extend it to high-dimensional feature spaces using a novel dynamic sampling strategy. In doing so, we introduce a new training paradigm that can be applied to any encoder architecture. We are able to enforce conditional independence of the diffusion autoencoder latent representation with respect to any protected attribute under the equalized odds constraint and show that this approach enables causal image generation with controllable latent spaces. Our experimental results demonstrate that our approach can achieve high accuracy on downstream tasks while upholding equality of odds.
条件独立性(CI)约束对于定义和评估机器学习中的公平性以及学习无偏或因果表示至关重要。传统方法要么盲目地学习与受保护变量相关的不变特征(例如,从面部图像中分类性别时,以种族为例),要么仅在模型输出上应用CI(例如,性别标签)。然而,这些方法在高维特征空间中实施CI均无效。在本文中,我们关注于一种新兴的方法,该方法以两个Jensen-Shannon熵项描述CI约束,并将其扩展到高维特征空间。通过使用一种新颖的动态采样策略,我们在等价机会约束下实现扩散自编码器潜在表示的联合独立性。我们还展示了这种方法能够实现具有可控制 latent 空间的可控因果图像生成。我们的实验结果表明,在保持等价机会的同时,我们的方法可以在下游任务上实现高准确度。
https://arxiv.org/abs/2404.13798
Graph neural networks (GNNs) have revolutionized the field of machine learning on non-Euclidean data such as graphs and networks. GNNs effectively implement node representation learning through neighborhood aggregation and achieve impressive results in many graph-related tasks. However, most neighborhood aggregation approaches are summation-based, which can be problematic as they may not be sufficiently expressive to encode informative graph structures. Furthermore, though the graph pooling module is also of vital importance for graph learning, especially for the task of graph classification, research on graph down-sampling mechanisms is rather limited. To address the above challenges, we propose a concatenation-based graph convolution mechanism that injectively updates node representations to maximize the discriminative power in distinguishing non-isomorphic subgraphs. In addition, we design a novel graph pooling module, called WL-SortPool, to learn important subgraph patterns in a deep-learning manner. WL-SortPool layer-wise sorts node representations (i.e. continuous WL colors) to separately learn the relative importance of subtrees with different depths for the purpose of classification, thus better characterizing the complex graph topology and rich information encoded in the graph. We propose a novel Subgraph Pattern GNN (SPGNN) architecture that incorporates these enhancements. We test the proposed SPGNN architecture on many graph classification benchmarks. Experimental results show that our method can achieve highly competitive results with state-of-the-art graph kernels and other GNN approaches.
图形神经网络(GNNs)在非欧氏数据(如图形和网络)领域已经颠覆了机器学习。GNNs通过聚类和实现节点表示学习有效地实现了节点表示学习,并在许多图形相关任务中取得了令人印象深刻的成果。然而,大多数聚类方法是基于求和的,这可能会有问题,因为他们可能不足以编码有用的图形结构。此外,尽管图形池化模块对于图形学习(尤其是图形分类)也非常重要,但关于图形 down-sampling机制的研究仍然相当有限。为了应对上述挑战,我们提出了一个基于连接的图形卷积机制,通过注入式更新节点表示以最大化区分类别的 discriminative power。此外,我们还设计了一个名为WL-SortPool的新颖图形池化模块,以在深度学习的方式学习中学习重要的子图模式。WL-SortPool对节点表示(即连续的WL颜色)进行层间排序,以分别学习具有不同深度的子树之间的相对重要性,从而更好地描述复杂的图形拓扑结构和图中所编码的丰富信息。我们提出了一个包含这些增强的全新的子图模式图形神经网络(SPGNN)架构。我们在许多图形分类基准上测试了所提出的SPGNN架构。实验结果表明,我们的方法可以与最先进的图形核和其他GNN方法一样实现高度竞争性的结果。
https://arxiv.org/abs/2404.13655
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
机器学习模型已经取得了巨大的进展,但在应用到未见过的领域时,它们仍然存在困难。本研究关注于领域泛化问题,即在训练过程中,模型学习一个未见过的领域,而在测试过程中,对多个未见过的领域进行测试。我们提出IMO:Invariant features Masks for Out-of-Distribution text classification,通过学习不变的特征来实现OOD泛化。在训练过程中,IMO会学习稀疏的掩码层,用于消除预测过程中的无关特征,而剩余的特征保持不变。此外,IMO在词级别有一个注意力模块,专注于对预测有用的词进行关注。我们的全面实验结果表明,IMO在各种评估指标和设置方面都显著优于强大的基线。
https://arxiv.org/abs/2404.13504
The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.
深度学习革命对诸如风格/领域转移、增强/修复和视觉质量评估等低级图像处理任务产生了强烈影响。尽管这些任务通常被单独处理,但前述任务都 share a common theme of understanding、editing或增强输入图像的视觉效果,而不会修改底层内容。我们利用这个观察结果开发了一种新颖的解耦表示学习方法,将输入分解为内容和外观特征。该模型以自监督的方式进行训练,并使用学习到的特征开发了一个名为DisQUE的新质量预测模型。我们在广泛的评估中证明了DisQUE在质量预测任务和失真类型上的最先进准确度。此外,我们还证明了相同特征还可以用于图像处理任务,如 HDR 色调映射,其中所需的输出特性可以通过使用示例输入-输出对进行调整。
https://arxiv.org/abs/2404.13484
Ontology matching is defined as finding a relationship or correspondence between two or more entities in two or more ontologies. To solve the interoperability problem of the domain ontologies, semantically similar entities in these ontologies must be found and aligned before merging them. GraphMatcher, developed in this study, is an ontology matching system using a graph attention approach to compute higher-level representation of a class together with its surrounding terms. The GraphMatcher has obtained remarkable results in in the Ontology Alignment Evaluation Initiative (OAEI) 2022 conference track. Its codes are available at ~\url{this https URL}.
语义匹配是一种在两个或多个语义网之间查找关系或对应关系的任务。为了解决领域语义网之间的互操作性问题,本研究开发了一种基于图注意力的语义匹配系统,用于计算类及其周围术语的高级表示。GraphMatcher在2022年Ontology Alignment Evaluation Initiative(OAEI)会议跟踪中取得了显著的成果。其代码可在此处下载:https://this https URL。
https://arxiv.org/abs/2404.14450
Decoding visual information from human brain activity has seen remarkable advancements in recent research. However, due to the significant variability in cortical parcellation and cognition patterns across subjects, current approaches personalized deep models for each subject, constraining the practicality of this technology in real-world contexts. To tackle the challenges, we introduce Wills Aligner, a robust multi-subject brain representation learner. Our Wills Aligner initially aligns different subjects' brains at the anatomical level. Subsequently, it incorporates a mixture of brain experts to learn individual cognition patterns. Additionally, it decouples the multi-subject learning task into a two-stage training, propelling the deep model and its plugin network to learn inter-subject commonality knowledge and various cognition patterns, respectively. Wills Aligner enables us to overcome anatomical differences and to efficiently leverage a single model for multi-subject brain representation learning. We meticulously evaluate the performance of our approach across coarse-grained and fine-grained visual decoding tasks. The experimental results demonstrate that our Wills Aligner achieves state-of-the-art performance.
近年来,从人脑活动解读视觉信息的研究取得了显著的进展。然而,由于不同受试者之间皮质分叶和认知模式的重大差异,为每个受试者定制深度模型在现实场景中限制了技术的实用性。为了解决这些挑战,我们引入了Wills Aligner,一个 robust 的多subject brain representation learner。 我们的Wills Aligner首先在解剖层面上对不同受试者的的大脑进行对齐。然后,它结合了多位脑专家来学习个体认知模式。此外,它将多subject学习任务转化为两个阶段的训练,推动深度模型及其插件网络学习跨受试者共性知识和各种认知模式。Wills Aligner使我们能够克服解剖差异,并有效地利用单个模型进行多subject brain representation learning。 我们详细评估了我们的方法在粗粒度和细粒度视觉解码任务上的性能。实验结果表明,我们的Wills Aligner达到了最先进的水平。
https://arxiv.org/abs/2404.13282
Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim's enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim's capabilities, which reveals that Vim uniquely emulates the pathologist workflow-unlike ViT. This alignment with human expert analysis highlights Vim's potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at \url{this https URL}.
从Gigapixel Whole Slide Images(WSI)中进行表示学习在计算病理学中是一个具有重大挑战性的问题,因为组织结构的复杂性和标注数据的稀疏性。多实例学习方法已经解决了这个挑战,通过利用图像补丁对预训练模型进行分类,利用自监督学习(SSL)方法实现。 SSL和MIL方法的表现都依赖于特征编码器的架构。本文提出利用Vision Mamba(Vim)架构,受到状态空间模型的启发,在DINO框架中进行计算病理学中代表学习的建议。我们在Camelyon16数据集上评估Vim与Vision Transformers(ViT)的性能,包括补丁水平和滑动级别分类。我们的研究结果表明,与ViT相比,Vim在较小规模上表现出色,特别是在较小规模上,Vim的ROC AUC模型大小增加了8.21。可解释性分析进一步强调了Vim的功能,揭示了Vim独特地模拟了病理学家工作流程,类似于ViT。这种与人类专家分析的 alignment 突出了Vim在实际诊断场景中的潜在能力,并显著地促进了开发有效的计算病理学中的表示学习算法。我们发布了代码和预训练权重在this链接处。
https://arxiv.org/abs/2404.13222
Sequential recommendation is dedicated to offering items of interest for users based on their history behaviors. The attribute-opinion pairs, expressed by users in their reviews for items, provide the potentials to capture user preferences and item characteristics at a fine-grained level. To this end, we propose a novel framework FineRec that explores the attribute-opinion pairs of reviews to finely handle sequential recommendation. Specifically, we utilize a large language model to extract attribute-opinion pairs from reviews. For each attribute, a unique attribute-specific user-opinion-item graph is created, where corresponding opinions serve as the edges linking heterogeneous user and item nodes. To tackle the diversity of opinions, we devise a diversity-aware convolution operation to aggregate information within the graphs, enabling attribute-specific user and item representation learning. Ultimately, we present an interaction-driven fusion mechanism to integrate attribute-specific user/item representations across all attributes for generating recommendations. Extensive experiments conducted on several realworld datasets demonstrate the superiority of our FineRec over existing state-of-the-art methods. Further analysis also verifies the effectiveness of our fine-grained manner in handling the task.
序列推荐旨在根据用户的浏览历史行为提供感兴趣的物品。用户在物品评论中表达的属性-意见对提供了捕捉用户偏好和物品特征的细粒度可能性。为此,我们提出了FineRec框架,该框架探索了评论中的属性-意见对,以细粒度处理序列推荐。具体来说,我们利用一个大语言模型从评论中提取属性-意见对。对于每个属性,创建一个独特的属性特定用户-意见-物品图,其中相应的意见作为连接异质用户和物品节点的边。为解决不同意见的多样性,我们设计了一个多样性感知卷积操作,用于汇总图中的信息,实现属性特定的用户和物品表示学习。最后,我们提出了一个交互式融合机制,将所有属性的属性特定用户/物品表示集成到生成推荐中。在多个现实世界数据集上进行的实验证实了我们的FineRec框架在现有技术水平上具有优越性。进一步的分析还证实了我们在处理任务上的细粒度方法的有效性。
https://arxiv.org/abs/2404.12975