Knowledge graphs (KGs), which store an extensive number of relational facts (head, relation, tail), serve various applications. While many downstream tasks highly rely on the expressive modeling and predictive embedding of KGs, most of the current KG representation learning methods, where each entity is embedded as a vector in the Euclidean space and each relation is embedded as a transformation, follow an entity ranking protocol. On one hand, such an embedding design cannot capture many-to-many relations. On the other hand, in many retrieval cases, the users wish to get an exact set of answers without any ranking, especially when the results are expected to be precise, e.g., which genes cause an illness. Such scenarios are commonly referred to as "set retrieval". This work presents a pioneering study on the KG set retrieval problem. We show that the set retrieval highly depends on expressive modeling of many-to-many relations, and propose a new KG embedding model SpherE to address this problem. SpherE is based on rotational embedding methods, but each entity is embedded as a sphere instead of a vector. While inheriting the high interpretability of rotational-based models, our SpherE can more expressively model one-to-many, many-to-one, and many-to-many relations. Through extensive experiments, we show that our SpherE can well address the set retrieval problem while still having a good predictive ability to infer missing facts. The code is available at this https URL.
知识图(KGs)作为一种存储大量关系事实(头,关系,尾)的数据结构,具有各种应用价值。尽管许多下游任务高度依赖于KGs的表示建模和预测嵌入,但目前大多数KG表示学习方法,其中每个实体以欧氏空间中的向量表示,每个关系以变换表示,都遵循实体排序协议。一方面,这种嵌入设计无法捕捉许多对多关系。另一方面,在许多检索案例中,用户希望获得一个无排名的准确集合答案,尤其是在结果预计精确的情况下,例如哪些基因导致疾病。这种情况通常被称为“集检索”。 本文在KG集检索问题上进行了一项开创性的研究。我们证明了集检索高度依赖于多对多关系的表示建模,并提出了一个新的KG嵌入模型SpherE来解决这个问题。SpherE基于旋转嵌入方法,但每个实体都被嵌入为一个球体而不是向量。虽然继承了旋转模型的高可解释性,但我们的SpherE可以更富有表现力地建模一对一、一对多和多对多关系。通过大量实验,我们证明了我们的SpherE可以在解决集检索问题的同时,仍具有推断缺失事实的良好预测能力。代码可在此处访问:https://www.acm.org/dl/d/2222216
https://arxiv.org/abs/2404.19130
The talking head generation recently attracted considerable attention due to its widespread application prospects, especially for digital avatars and 3D animation design. Inspired by this practical demand, several works explored Neural Radiance Fields (NeRF) to synthesize the talking heads. However, these methods based on NeRF face two challenges: (1) Difficulty in generating style-controllable talking heads. (2) Displacement artifacts around the neck in rendered images. To overcome these two challenges, we propose a novel generative paradigm \textit{Embedded Representation Learning Network} (ERLNet) with two learning stages. First, the \textit{ audio-driven FLAME} (ADF) module is constructed to produce facial expression and head pose sequences synchronized with content audio and style video. Second, given the sequence deduced by the ADF, one novel \textit{dual-branch fusion NeRF} (DBF-NeRF) explores these contents to render the final images. Extensive empirical studies demonstrate that the collaboration of these two stages effectively facilitates our method to render a more realistic talking head than the existing algorithms.
口语头生成最近因其在数字虚拟人和3D动画设计方面的广泛应用前景而引起了相当的关注。受到这一实际需求的启发,几篇论文探索了使用Neural Radiance Fields(NeRF)合成口语头。然而,基于NeRF的方法面临两个挑战:(1)生成风格可控的口语头困难。(2)在渲染图像中围绕颈部发生的位移伪影。为了克服这两个挑战,我们提出了一个名为ERLNet(嵌入表示学习网络)的新生成范式,包括两个学习阶段。首先,构造了一个音频驱动的FLAME(ADF)模块,用于产生与内容音频和风格视频同步的面部表情和头姿序列。其次,根据FLAME计算的序列,一篇名为DBF-NeRF的新口语头融合NeRF(DBF-NeRF)探索了这些内容,以渲染最终图像。大量的实证研究证明了这两个阶段的协同作用有效地促进了我们方法比现有算法渲染更逼真的口语头的效果。
https://arxiv.org/abs/2404.19038
Understanding the severity of conditions shown in images in medical diagnosis is crucial, serving as a key guide for clinical assessment, treatment, as well as evaluating longitudinal progression. This paper proposes Con- PrO: a novel representation learning method for severity assessment in medical images using Contrastive learningintegrated Preference Optimization. Different from conventional contrastive learning methods that maximize the distance between classes, ConPrO injects into the latent vector the distance preference knowledge between various severity classes and the normal class. We systematically examine the key components of our framework to illuminate how contrastive prediction tasks acquire valuable representations. We show that our representation learning framework offers valuable severity ordering in the feature space while outperforming previous state-of-the-art methods on classification tasks. We achieve a 6% and 20% relative improvement compared to a supervised and a self-supervised baseline, respectively. In addition, we derived discussions on severity indicators and related applications of preference comparison in the medical domain.
理解医学图像中显示病情的严重程度对于医疗诊断至关重要,作为临床评估、治疗以及评估病程进展的关键指导。本文提出了一种名为Con-PrO的新的图像严重程度评估方法,该方法使用对比学习集成偏好优化。与传统的对比学习方法不同,ConPrO将各种严重程度类之间的距离偏好知识注入到潜在向量中。我们系统地检查我们框架的关键组件,以阐明对比预测任务如何获得有价值的表示。我们证明了,与以前的最先进方法相比,我们的表示学习框架在分类任务上实现了6%和20%的相对改进。此外,我们讨论了病理性指标及其在医学领域中的相关应用。
https://arxiv.org/abs/2404.18831
Text-rich graphs, which exhibit rich textual information on nodes and edges, are prevalent across a wide range of real-world business applications. Large Language Models (LLMs) have demonstrated remarkable abilities in understanding text, which also introduced the potential for more expressive modeling in text-rich graphs. Despite these capabilities, efficiently applying LLMs to representation learning on graphs presents significant challenges. Recently, parameter-efficient fine-tuning methods for LLMs have enabled efficient new task generalization with minimal time and memory consumption. Inspired by this, we introduce Graph-aware Parameter-Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning with LLMs on text-rich graphs. Specifically, we utilize a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt. This prompt is then inserted at the beginning of the text sequence. To improve the quality of graph prompts, we pre-trained the GNN to assist the frozen LLM in predicting the next token in the node text. Compared with existing joint GNN and LMs, our method directly generate the node embeddings from large language models with an affordable fine-tuning cost. We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations. Our results demonstrate the efficacy and efficiency of our model, showing that it can be smoothly integrated with various large language models, including OPT, LLaMA and Falcon.
文本丰富的图在节点和边上表现出丰富的文本信息,这在广泛的现实业务应用中非常普遍。大型语言模型(LLMs)在理解文本方面表现出令人印象深刻的的能力,这也引入了在文本丰富的图中进行更富有表现力的建模的潜力。尽管具有这些能力,将LLM应用于图表示学习仍然存在显著的挑战。最近,参数高效的微调方法为LLM带来了高效的任务泛化,且最小化时间和内存消耗。受到这一启发,我们引入了Graph-aware Parameter-Efficient Fine-Tuning - GPEFT,一种用于在文本丰富的图中以LLM进行高效表示学习的全新方法。具体来说,我们利用图神经网络(GNN)将相邻节点的信息编码成图提示。然后将这个提示插入文本序列的开头。为了提高图提示的质量,我们在预训练GNN的基础上,帮助冻活的LLM预测节点文本中的下一个词。与现有的联合GNN和LM相比,我们的方法可以直接从大型语言模型上以可负担的成本生成节点嵌入。我们在8个不同的文本丰富的图形上进行了全面的实验,观察到平均命中率@1和Mean Reciprocal Rank(MRR)在链路预测评估中的平均提升2%。我们的结果证明了我们的模型的有效性和高效性,表明它可以轻松地与各种大型语言模型,包括OPT,LLLaMA和Falcon集成。
https://arxiv.org/abs/2404.18271
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.
持续学习(CL)是深度神经网络中一个长期存在的挑战,因为之前学习的知识会因为梯度消失而丢失。尽管基于练习的方法在减轻梯度消失方面相当成功,但它们在缓冲样本和先验信息损失方面过于拟合,阻碍了在低缓冲 regime下的泛化能力。受到人类使用强归纳偏见学习的方式启发,我们提出了一种基于对比学习(CRL)的低缓冲时经验回放(IMEX-Reg)方法,以提高在低缓冲 regime 下 CL 的泛化性能。具体来说,我们采用对比表示学习(CRL)中的双峰隐式-显式正则化方法,并结合一致性正则化。为了更好地利用使用 CRL 学习到的表示之间的关系,我们提出了一种引导分类器朝 CRL 中单位球体激活关联的方向的规范化策略。我们的结果表明,IMEX-Reg 显著提高了泛化性能,在多个 CL 场景中超过了基于练习的方法。它还对于自然和对抗性污染具有较低的任务晚期偏见。此外,我们还提供了进一步支持我们设计决策的理论和实验洞察。
https://arxiv.org/abs/2404.18161
Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in various graph representation learning tasks. Recently, studies revealed their vulnerability to adversarial attacks. In this work, we theoretically define the concept of expected robustness in the context of attributed graphs and relate it to the classical definition of adversarial robustness in the graph representation learning literature. Our definition allows us to derive an upper bound of the expected robustness of Graph Convolutional Networks (GCNs) and Graph Isomorphism Networks subject to node feature attacks. Building on these findings, we connect the expected robustness of GNNs to the orthonormality of their weight matrices and consequently propose an attack-independent, more robust variant of the GCN, called the Graph Convolutional Orthonormal Robust Networks (GCORNs). We further introduce a probabilistic method to estimate the expected robustness, which allows us to evaluate the effectiveness of GCORN on several real-world datasets. Experimental experiments showed that GCORN outperforms available defense methods. Our code is publicly available at: \href{this https URL}{this https URL}.
图形神经网络(GNNs)在各种图表示学习任务中展示了最先进的性能。最近的研究表明,它们对对抗攻击非常脆弱。在本文中,我们理论性地定义了在属性图背景下 expected robustness 的概念,并将其与图表示学习文献中的经典对抗鲁棒性定义联系起来。我们的定义允许我们推导出 Graph Convolutional Networks (GCNs) 和 Graph Isomorphism Networks subject to node feature attacks 的预期鲁棒性的上界。基于这些发现,我们将 GNNs 的预期鲁棒性与它们的权重矩阵的正交性联系起来,进而提出了一个攻击-独立、更鲁棒的 GCN 变体,称为 Graph Convolutional Orthonormal Robust Networks (GCORNs)。我们还引入了一种概率方法来估计预期鲁棒性,使我们能够评估 GCORN 在多个现实世界数据集上的效果。实验实验表明 GCORN 超过了可用的防御方法。我们的代码公开可用:\href{this <https:// this URL> }{this <https:// this URL>}.
https://arxiv.org/abs/2404.17947
Modeling the dynamics of interacting entities using an evolving graph is an essential problem in fields such as financial networks and e-commerce. Traditional approaches focus primarily on pairwise interactions, limiting their ability to capture the complexity of real-world interactions involving multiple entities and their intricate relationship structures. This work addresses the problem of forecasting higher-order interaction events in multi-relational recursive hypergraphs. This is done using a dynamic graph representation learning framework that can capture complex relationships involving multiple entities. The proposed model, \textit{Relational Recursive Hyperedge Temporal Point Process} (RRHyperTPP) uses an encoder that learns a dynamic node representation based on the historical interaction patterns and then a hyperedge link prediction based decoder to model the event's occurrence. These learned representations are then used for downstream tasks involving forecasting the type and time of interactions. The main challenge in learning from hyperedge events is that the number of possible hyperedges grows exponentially with the number of nodes in the network. This will make the computation of negative log-likelihood of the temporal point process expensive, as the calculation of survival function requires a summation over all possible hyperedges. In our work, we use noise contrastive estimation to learn the parameters of our model, and we have experimentally shown that our models perform better than previous state-of-the-art methods for interaction forecasting.
使用一个随机的图来建模交互实体之间的动态关系是金融网络和电子商务等领域中一个至关重要的任务。传统的解决方案主要关注一对一交互,限制了它们对涉及多个实体及其复杂关系结构的现实交互的捕捉能力。本文解决了在多关系递归超图中预测更高阶交互事件的问题。这是通过使用动态图表示学习框架来实现的,该框架可以捕捉涉及多个实体的复杂关系。所提出的模型《关系递归超网时空点过程》(RRHyperTPP)使用一个编码器,根据历史交互模式学习动态节点表示,然后使用解码器基于预测事件的发生来建模其发生。这些学习到的表示随后用于下游任务,包括预测交互的类型和时间。学习从超边缘事件的主要挑战是,随着网络中节点数的增加,可能的超边数呈指数增长。这将使得计算时间点过程的负对数似然函数变得昂贵,因为计算生存函数需要对所有可能的超边进行求和。在我们的工作中,我们使用噪声对比估计来学习我们的模型的参数,并通过实验已经证明了我们的模型在交互预测方面优于以前的先进方法。
https://arxiv.org/abs/2404.17943
To deal with heterogeneity resulting from label distribution skew and data scarcity in distributed machine learning scenarios, this paper proposes a novel Personalized Federated Learning (PFL) algorithm, named Federated Contrastive Representation Learning (FedCRL). FedCRL introduces contrastive representation learning (CRL) on shared representations to facilitate knowledge acquisition of clients. Specifically, both local model parameters and averaged values of local representations are considered as shareable information to the server, both of which are then aggregated globally. CRL is applied between local representations and global representations to regularize personalized training by drawing similar representations closer and separating dissimilar ones, thereby enhancing local models with external knowledge and avoiding being harmed by label distribution skew. Additionally, FedCRL adopts local aggregation between each local model and the global model to tackle data scarcity. A loss-wise weighting mechanism is introduced to guide the local aggregation using each local model's contrastive loss to coordinate the global model involvement in each client, thus helping clients with scarce data. Our simulations demonstrate FedCRL's effectiveness in mitigating label heterogeneity by achieving accuracy improvements over existing methods on datasets with varying degrees of label heterogeneity.
在分布式机器学习场景中处理标签分布不均匀和数据稀缺问题,本文提出了一种名为Federated Contrastive Representation Learning(FedCRL)的新个性化联邦学习(PFL)算法。FedCRL通过在共享表示上进行对比性表示学习(CRL)来促进客户端知识获取。具体来说,将本地模型的参数和局部表示的平均值视为可共享的信息,然后在全球范围内进行聚合。CRL在本地表示和全局表示之间应用,通过将类似的代表性绘制成更接近,将不相似的代表性分离,从而通过外部知识增强本地模型,并避免因标签分布不均匀受到伤害。此外,FedCRL通过在本地模型和全局模型之间进行局部聚合来解决数据稀缺问题。引入了一种基于每个局部模型对比损失的局部聚合机制,以协调全局模型在每个客户端的参与程度,从而帮助缺乏数据的客户端。 我们对FedCRL在处理不同程度标签异质性的数据效果进行了仿真,结果表明,FedCRL通过实现对不同程度标签异质性数据的准确率提升,有效减轻了标签异质性带来的影响。
https://arxiv.org/abs/2404.17916
The representation of events in text plays a significant role in various NLP tasks. Recent research demonstrates that contrastive learning has the ability to improve event comprehension capabilities of Pre-trained Language Models (PLMs) and enhance the performance of event representation learning. However, the efficacy of event representation learning based on contrastive learning and PLMs is limited by the short length of event texts. The length of event texts differs significantly from the text length used in the pre-training of PLMs. As a result, there is inconsistency in the distribution of text length between pre-training and event representation learning, which may undermine the learning process of event representation based on PLMs. In this study, we present PromptCL, a novel framework for event representation learning that effectively elicits the capabilities of PLMs to comprehensively capture the semantics of short event texts. PromptCL utilizes a Prompt template borrowed from prompt learning to expand the input text during Contrastive Learning. This helps in enhancing the event representation learning by providing a structured outline of the event components. Moreover, we propose Subject-Predicate-Object (SPO) word order and Event-oriented Masked Language Modeling (EventMLM) to train PLMs to understand the relationships between event components. Our experimental results demonstrate that PromptCL outperforms state-of-the-art baselines on event related tasks. Additionally, we conduct a thorough analysis and demonstrate that using a prompt results in improved generalization capabilities for event representations. Our code will be available at this https URL.
文本中事件表示在各种自然语言处理任务中扮演着重要角色。最近的研究表明,对比学习有能力提高预训练语言模型(PLMs)的事件理解能力,并增强事件表示学习的效果。然而,基于对比学习和PLMs的事件表示学习的效果受到短事件文本长度的影响。事件文本的长度与PLMs预训练时使用的文本长度显著不同。因此,在预训练和事件表示学习之间文本长度的分布存在不稳定性,这可能削弱基于PLMs的事件表示学习过程。在本研究中,我们提出了PromptCL,一种新颖的事件表示学习框架,有效地激发了PLMs的全面理解短事件文本的能力。PromptCL利用从提示学习中借用的提示模板在对比学习期间扩展输入文本。这有助于增强事件表示学习,通过提供事件组件的有序结构。此外,我们提出了Subject-Predicate-Object(SPO)词序和Event-oriented Masked Language Modeling(EventMLM)来训练PLMs理解事件组件之间的关系。我们的实验结果表明,PromptCL在事件相关任务上优于最先进的基线。此外,我们进行了详细的分析和演示,使用提示会导致事件表示的泛化能力得到改善。我们的代码将在此处https:// URL上提供。
https://arxiv.org/abs/2404.17877
Prevalent solution for BioNER involves using representation learning techniques coupled with sequence labeling. However, such methods are inherently task-specific, demonstrate poor generalizability, and often require dedicated model for each dataset. To leverage the versatile capabilities of recently remarkable large language models (LLMs), several endeavors have explored generative approaches to entity extraction. Yet, these approaches often fall short of the effectiveness of previouly sequence labeling approaches. In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets. By combining the LLM's understanding of instructions with sequence labeling techniques, we use mix of datasets to train a model capable of extracting various types of entities. Given that the backbone LLMs lacks specialized medical knowledge, we also integrate external entity knowledge bases and employ instruction tuning to compel the model to densely recognize carefully curated entities. Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems, achieving the highest F1 scores across three datasets.
当前解决BioNER问题的方法涉及使用表示学习和序列标注技术相结合。然而,这些方法固有地针对特定任务,表现出了差的泛化能力,通常需要为每个数据集使用专门的模型。为了利用最近在大型语言模型(LLMs)上取得的显著创新,几个研究探索了实体提取的生成方法。然而,这些方法往往无法实现前序列标注方法的效力。在本文中,我们利用开源的LLM LLaMA2作为基础模型,并针对不同类型的实体和数据集设计特定的指令。通过将LLM的指令理解与序列标注技术相结合,我们使用混合数据集来训练一个能够提取各种类型实体的模型。鉴于基础LLMs缺乏专门的医疗知识,我们还引入了外部实体知识库,并使用指令调整来促使模型对精心挑选的实体进行密集识别。我们的模型VANER,通过小参数分量训练,显著优于基于LLM的模型,并且作为基于LLM的模型,第一次超越了大多数传统BioNER系统的水平,在三个数据集上的F1得分最高。
https://arxiv.org/abs/2404.17835
Federated learning ensures the privacy of clients by conducting distributed training on individual client devices and sharing only the model weights with a central server. However, in real-world scenarios, the heterogeneity of data among clients necessitates appropriate personalization methods. In this paper, we aim to address this heterogeneity using a form of parameter decoupling known as representation learning. Representation learning divides deep learning models into 'base' and 'head' components. The base component, capturing common features across all clients, is shared with the server, while the head component, capturing unique features specific to individual clients, remains local. We propose a new representation learning-based approach that suggests decoupling the entire deep learning model into more densely divided parts with the application of suitable scheduling methods, which can benefit not only data heterogeneity but also class heterogeneity. In this paper, we compare and analyze two layer scheduling approaches, namely forward (\textit{Vanilla}) and backward (\textit{Anti}), in the context of data and class heterogeneity among clients. Our experimental results show that the proposed algorithm, when compared to existing personalized federated learning algorithms, achieves increased accuracy, especially under challenging conditions, while reducing computation costs.
联邦学习通过在个人设备上进行分布式训练并仅将模型权重与中央服务器共享来确保客户的隐私。然而,在现实场景中,数据在客户端之间的异质性需要适当的数据个性化方法。在本文中,我们使用一种称为表示学习的形式来解决异质性。表示学习将深度学习模型划分为“基础”和“头”组件。基础组件捕捉所有客户端共有的特征,与服务器共享;头组件捕捉单个客户端独特的特征,保留在本地。我们提出了一种新的表示学习-基于的方法,通过应用适当的调度方法,将整个深度学习模型划分为更密集的部分。这种方法不仅可以解决数据异质性,还可以减少类异质性。本文在客户端之间数据和类异质性的背景下,比较和分析了两种层调度方法(即前向(“Vanilla”)和后向(“Anti”))。我们的实验结果表明,与现有的个性化联邦学习算法相比,所提出的算法在数据和类异质性方面具有更高的准确度,尤其是在具有挑战性条件的场景下,同时减少了计算成本。
https://arxiv.org/abs/2404.17799
Diffusion probabilistic models (DPMs) have become the state-of-the-art in high-quality image generation. However, DPMs have an arbitrary noisy latent space with no interpretable or controllable semantics. Although there has been significant research effort to improve image sample quality, there is little work on representation-controlled generation using diffusion models. Specifically, causal modeling and controllable counterfactual generation using DPMs is an underexplored area. In this work, we propose CausalDiffAE, a diffusion-based causal representation learning framework to enable counterfactual generation according to a specified causal model. Our key idea is to use an encoder to extract high-level semantically meaningful causal variables from high-dimensional data and model stochastic variation using reverse diffusion. We propose a causal encoding mechanism that maps high-dimensional data to causally related latent factors and parameterize the causal mechanisms among latent factors using neural networks. To enforce the disentanglement of causal variables, we formulate a variational objective and leverage auxiliary label information in a prior to regularize the latent space. We propose a DDIM-based counterfactual generation procedure subject to do-interventions. Finally, to address the limited label supervision scenario, we also study the application of CausalDiffAE when a part of the training data is unlabeled, which also enables granular control over the strength of interventions in generating counterfactuals during inference. We empirically show that CausalDiffAE learns a disentangled latent space and is capable of generating high-quality counterfactual images.
扩散概率模型(DPMs)已经成为高质量图像生成的领先技术。然而,DPMs具有任意噪声的潜在空间,没有可解释或可控制的意义。尽管在提高图像样本质量方面已经进行了大量的研究努力,但在使用扩散模型进行表示控制生成方面,工作还很少。具体来说,使用DPM进行因果建模和可控制反事实生成是一个未被探索的领域。 在这项工作中,我们提出CausalDiffAE,一种基于扩散的因果表示学习框架,以实现根据指定因果模型的反事实生成。我们的关键想法是使用编码器从高维数据中提取高级语义的有意义的因果变量,并使用反向扩散建模随机变化。我们提出了一种因果编码机制,将高维数据映射到相关潜在因素,并通过神经网络参数化因果机制。为了确保因果变量的离散化,我们定义了一个变分目标,并利用先验标签信息对潜在空间进行 Regularization。我们还提出了一个基于DDIM的生成反事实程序。 最后,为了应对有限的标记监督情况,我们还研究了在训练数据部分未标记的情况下如何应用CausalDiffAE,这也能在推理过程中对干预强度进行细粒度控制。我们通过实验验证,CausalDiffAE能够学习到一个分离的潜在空间,并能够生成高质量的反事实图像。
https://arxiv.org/abs/2404.17735
Optical Doppler Tomography (ODT) is a blood flow imaging technique popularly used in bioengineering applications. The fundamental unit of ODT is the 1D frequency response along the A-line (depth), named raw A-scan. A 2D ODT image (B-scan) is obtained by first sensing raw A-scans along the B-line (width), and then constructing the B-scan from these raw A-scans via magnitude-phase analysis and post-processing. To obtain a high-resolution B-scan with a precise flow map, densely sampled A-scans are required in current methods, causing both computational and storage burdens. To address this issue, in this paper we propose a novel sparse reconstruction framework with four main sequential steps: 1) early magnitude-phase fusion that encourages rich interaction of the complementary information in magnitude and phase, 2) State Space Model (SSM)-based representation learning, inspired by recent successes in Mamba and VMamba, to naturally capture both the intra-A-scan sequential information and between-A-scan interactions, 3) an Inception-based Feedforward Network module (IncFFN) to further boost the SSM-module, and 4) a B-line Pixel Shuffle (BPS) layer to effectively reconstruct the final results. In the experiments on real-world animal data, our method shows clear effectiveness in reconstruction accuracy. As the first application of SSM for image reconstruction tasks, we expect our work to inspire related explorations in not only efficient ODT imaging techniques but also generic image enhancement.
光多普勒成像(ODT)是一种在生物工程应用中广受欢迎的血流成像技术。ODT的基本单元是沿着A线(深度)的1D频率响应,称为原始A扫描。通过先沿着B线(宽度)感应原始A扫描,然后通过幅度-相位分析和平处理这些原始A扫描来获得二维ODT图像(B-扫描)。为了获得具有精确流量图的高分辨率B-扫描,当前方法需要高密度的A扫描,导致计算和存储负担较高。为了解决这个问题,本文提出了一种新颖的稀疏重构框架,包括四个主要的序列步骤:1)早期的幅度-相位融合,鼓励 magnitude 和 phase 互补信息的丰富互动,2)基于状态空间模型的表示学习,受到Mamba 和 VMamba 最近成功的影响,以自然地捕捉 both the intra-A-scan sequential information and between-A-scan interactions,3)一种Inception-based Feedforward Network模块(IncFFN)来进一步增强SSM-module,4)一种B-线像素重排(BPS)层,以有效地重构最终结果。在现实世界动物数据上的实验表明,我们的方法在重建准确性方面表现出明显的效果。作为SSM用于图像重建任务的第一个应用,我们期望我们的工作将激发关于高效ODT成像技术和通用图像增强的相關探索。
https://arxiv.org/abs/2404.17484
Self-Supervised Learning (SSL) is a valuable and robust training methodology for contemporary Deep Neural Networks (DNNs), enabling unsupervised pretraining on a `pretext task' that does not require ground-truth labels/annotation. This allows efficient representation learning from massive amounts of unlabeled training data, which in turn leads to increased accuracy in a `downstream task' by exploiting supervised transfer learning. Despite the relatively straightforward conceptualization and applicability of SSL, it is not always feasible to collect and/or to utilize very large pretraining datasets, especially when it comes to real-world application settings. In particular, in cases of specialized and domain-specific application scenarios, it may not be achievable or practical to assemble a relevant image pretraining dataset in the order of millions of instances or it could be computationally infeasible to pretrain at this scale. This motivates an investigation on the effectiveness of common SSL pretext tasks, when the pretraining dataset is of relatively limited/constrained size. In this context, this work introduces a taxonomy of modern visual SSL methods, accompanied by detailed explanations and insights regarding the main categories of approaches, and, subsequently, conducts a thorough comparative experimental evaluation in the low-data regime, targeting to identify: a) what is learnt via low-data SSL pretraining, and b) how do different SSL categories behave in such training scenarios. Interestingly, for domain-specific downstream tasks, in-domain low-data SSL pretraining outperforms the common approach of large-scale pretraining on general datasets. Grounded on the obtained results, valuable insights are highlighted regarding the performance of each category of SSL methods, which in turn suggest straightforward future research directions in the field.
自监督学习(SSL)是一种现代深度神经网络(DNN)的有价值且鲁棒的教学方法,允许在不需要真实标签/注释的情况下进行无监督预训练。这使得从大量的未标注训练数据中进行有效的表示学习成为可能,从而在下游任务上提高准确性,通过利用监督迁移学习。尽管 SSL 的概念化和应用非常简单,但在实际应用场景中收集和/或利用非常大的预训练数据集通常是不可行的或不太实际的。特别是在专业和领域特定的应用场景中,可能无法按百万实例的顺序组装相关的图像预训练数据集,或者在当前规模上进行预训练可能具有计算上的可行性。因此,进行研究来评估 SSL 预训练任务的效力就显得尤为重要。在数据量有限/受约束的情况下,这项工作引入了一个现代视觉 SSL 方法的分类学,同时对主要方法类别进行了详细解释和洞察,随后在低数据量的情况下进行了全面的比较实验,旨在确定:a)低数据量 SSL 预训练过程中学到的知识;b)不同 SSL 类别在训练场景中的行为。有趣的是,在领域特定的下游任务中,基于领域的低数据量 SSL 预训练超过了通用数据集的大型预训练方法。根据所得到的结果,对每个 SSL 方法类别的性能进行了突出,这进而提出了该领域未来研究的明确方向。
https://arxiv.org/abs/2404.17202
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.
选择性注意帮助我们将注意力集中在与任务相关的感官输入中的不断涌现的方面。这种感知约束使我们能够在分心的情况下稳健地推广,并研究可感知概念的新组合。变压器在架构中采用与注意力类似的观念,但是使用变换器骨干的表示学习模型(如CLIP和DINO)通常无法展示稳健性和可组合性。我们突出了一个缺失的架构先验:与人类感知不同,变压器编码不分别关注单个概念。为了应对这一问题,我们提出了SPARO,一种输出机制,将编码分为由单个注意头生成的单独关注的位置。使用SPARO与CLIP相结合,为视觉和文本模式提供归纳偏见,即视觉和文本模式是具有相同相应概念的共享组合世界的不同视图。使用SPARO,我们在CLIP(ImageNet上的改进超过+14%,SugarCrepe上的改进超过+4%)和DINO(ImageNet上的改进超过+3% each)的下游识别、鲁棒性、检索和可组合性基准测试中取得了改善,并且通过与最近邻和线性探针结合使用SPARO(改进超过+4%,从SugarCrepe的+4%到ImageNet的+9%)证明了强大的干预和选择单个SPARO概念的能力,进一步提高了下游任务的性能(从SugarCrepe的+4%到ImageNet的+9%的改进)。我们还通过消融实验和概念可视化展示了SPARO表示结构的稳健性。最后,我们提供了通过消融实验和可视化学习到的概念的见解。
https://arxiv.org/abs/2404.15721
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge, allowing each individual component model to learn maximally distinct, complementary representations. We make three detailed explanations of why this works and experimental results demonstrate that our method more efficiently improves performance compared to traditional MMF. Furthermore, attribution analysis validates the model trained under ACoRL acquires more complementary knowledge, highlighting the efficacy of our approach in enhancing efficiency and robustness across tasks.
单模型系统通常在诸如演讲验证(SV)和图像分类等任务中存在不足,因此在决策过程中严重依赖先验知识,导致性能较低。尽管多模型融合(MMF)可以在一定程度上减轻这些问题,但学习到的表示的冗余可能限制了提高。为此,我们提出了一个对抗性互补表示学习(ACoRL)框架,使新训练的模型能够避免之前获得的知识,使得每个组件模型能够学习到最独特的互补表示。我们详细解释了这种方法的工作原理,并进行了实验验证,表明与传统MMF相比,我们的方法能更有效地提高性能。此外,归因分析证实,在ACoRL框架下训练的模型获得了更多的互补知识,这表明我们的方法在提高任务效率和鲁棒性方面具有有效性。
https://arxiv.org/abs/2404.15704
The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.
Vision Transformer (ViT) 在自监督学习 (SSL) 中对 3D 医疗图像分析的性能已经取得了显著的突破。对于特征预训练,采用掩码自动编码器(MAE)进行预训练可能进一步释放 ViT 在各种医疗视觉任务上的潜力。然而,由于 3D 医疗图像具有较大的空间尺寸和高维,MAE 可能缺乏层次结构设计,这可能会阻碍下游任务的性能。在本文中,我们提出了一个名为“掩码在掩码(MiM)”预训练框架,旨在通过从不同尺度之间的层次视觉令牌中学习具有判别性的表示来提高 MAE 的性能。我们引入了多个级别的粒度来处理掩码输入的体积,然后同时在不同精度和粗略水平上进行重建。此外,在相邻级别卷积中应用跨层对齐机制,以确保解剖结构相似性层次结构的强制。此外,我们还采用了一种混合骨干网络来提高预训练期间层次表示学习的高效性。MiM 在包括全身各个部位的较大范围内预先训练,这些预训练数据集包含各种器官/病灶/肿瘤。在十三个公开数据集上的广泛实验证明,MiM 在其他 SSL 方法在器官/病灶/肿瘤分割和疾病分类方面具有优越性。我们进一步将 MiM 扩展到具有超过 10k 个卷积的预训练大样本数据集,表明大规模预训练可以进一步增强下游任务的性能。这种改进还得出结论,研究社区应更加关注 healthcare foundation model 用于 3D 医疗图像的预训练数据集的大小。
https://arxiv.org/abs/2404.15580
Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields, including drug discovery and materials science. Existing studies adopt a global alignment approach to learn the knowledge from different modalities. These global alignment approaches fail to capture fine-grained information, such as molecular fragments and their corresponding textual description, which is crucial for downstream tasks. Furthermore, it is incapable to model such information using a similar global alignment strategy due to data scarcity of paired local part annotated data from existing datasets. In this paper, we propose Atomas, a multi-modal molecular representation learning framework to jointly learn representations from SMILES string and text. We design a Hierarchical Adaptive Alignment model to concurrently learn the fine-grained fragment correspondence between two modalities and align these representations of fragments in three levels. Additionally, Atomas's end-to-end training framework incorporates the tasks of understanding and generating molecule, thereby supporting a wider range of downstream tasks. In the retrieval task, Atomas exhibits robust generalization ability and outperforms the baseline by 30.8% of recall@1 on average. In the generation task, Atomas achieves state-of-the-art results in both molecule captioning task and molecule generation task. Moreover, the visualization of the Hierarchical Adaptive Alignment model further confirms the chemical significance of our approach. Our codes can be found at https://anonymous.4open.science/r/Atomas-03C3.
分子和文本跨模态表示学习已成为提高分子表示质量的有前景的方向,从而在药物发现和材料科学等领域提高性能。现有研究采用全局对齐方法从不同模态中学习知识。然而,这些全局对齐方法无法捕捉到细粒度信息,例如分子片段及其相应的文本描述,这对下游任务至关重要。此外,由于现有数据集的配对局部部分注释数据较少,它无法使用类似的全局对齐策略来建模这些信息。在本文中,我们提出了Atomas,一种多模态分子表示学习框架,共同学习来自SMILES字符串和文本的表示。我们设计了一个等级适应性对齐模型,以同时学习两个模态中片段的细粒度对应关系,并将这些片段表示对齐到三个层次。此外,Atomas的端到端训练框架包括理解和解构分子的任务,从而支持更广泛的下游任务。在检索任务中,Atomas表现出稳健的泛化能力,平均比基线高30.8%的召回率。在生成任务中,Atomas在分子摘要任务和分子生成任务上实现最先进的结果。此外,层次结构适应性对齐模型的可视化进一步证实了我们的方法具有重要的化学意义。我们的代码可以在https://anonymous.4open.science/r/Atomas-03C3中找到。
https://arxiv.org/abs/2404.16880
Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.
将预训练的图神经网络提取可转移知识并将其应用于下游任务的实际标准已经成为了图形表示学习的事实标准。 最近的工作集中在设计自监督的预训练任务,以从大规模未标注数据中提取有用的和通用的可转移知识。 然而,他们必须面对一个不可避免的质疑: 旨在提取预训练任务的有用信息的传统预训练策略,可能无法提取下游任务的全部有用信息。 在本文中,我们重新审视了传统预训练和微调框架中的预训练过程,从信息瓶颈(IB)的角度出发,证实了预训练阶段遗忘现象可能会对下游任务造成严重损害。 因此,我们提出了一个新颖的 \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP)框架,该框架在预训练阶段通过抑制压缩操作来尽可能保持潜在表示和训练数据之间的互信息,并将压缩操作延迟到微调阶段,以确保压缩可以引导有标签的微调数据和下游任务。 为了实现这一目标,我们设计了一个可以直接优化且可以进一步集成到实际模型设计中的两个信息控制目标。 在化学和生物学领域进行的大量实验证明DBP的有效性。
https://arxiv.org/abs/2404.14941