Sequential recommendation is dedicated to offering items of interest for users based on their history behaviors. The attribute-opinion pairs, expressed by users in their reviews for items, provide the potentials to capture user preferences and item characteristics at a fine-grained level. To this end, we propose a novel framework FineRec that explores the attribute-opinion pairs of reviews to finely handle sequential recommendation. Specifically, we utilize a large language model to extract attribute-opinion pairs from reviews. For each attribute, a unique attribute-specific user-opinion-item graph is created, where corresponding opinions serve as the edges linking heterogeneous user and item nodes. To tackle the diversity of opinions, we devise a diversity-aware convolution operation to aggregate information within the graphs, enabling attribute-specific user and item representation learning. Ultimately, we present an interaction-driven fusion mechanism to integrate attribute-specific user/item representations across all attributes for generating recommendations. Extensive experiments conducted on several realworld datasets demonstrate the superiority of our FineRec over existing state-of-the-art methods. Further analysis also verifies the effectiveness of our fine-grained manner in handling the task.
序列推荐旨在根据用户的浏览历史行为提供感兴趣的物品。用户在物品评论中表达的属性-意见对提供了捕捉用户偏好和物品特征的细粒度可能性。为此,我们提出了FineRec框架,该框架探索了评论中的属性-意见对,以细粒度处理序列推荐。具体来说,我们利用一个大语言模型从评论中提取属性-意见对。对于每个属性,创建一个独特的属性特定用户-意见-物品图,其中相应的意见作为连接异质用户和物品节点的边。为解决不同意见的多样性,我们设计了一个多样性感知卷积操作,用于汇总图中的信息,实现属性特定的用户和物品表示学习。最后,我们提出了一个交互式融合机制,将所有属性的属性特定用户/物品表示集成到生成推荐中。在多个现实世界数据集上进行的实验证实了我们的FineRec框架在现有技术水平上具有优越性。进一步的分析还证实了我们在处理任务上的细粒度方法的有效性。
https://arxiv.org/abs/2404.12975
We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are unconditional models that do not integrate scene or other contextual information. As a consequence, they have limited applicability and rely on costly training data. To address these limitations, we propose a new method ,dubbed Purposer, based on neural discrete representation learning. Our model is capable of exploiting, in a flexible manner, different types of information already present in open access large-scale datasets such as AMASS. First, we encode unconditional human motion into a discrete latent space. Second, an autoregressive generative model, conditioned with key contextual information, either with prompting or additive tokens, and trained for next-step prediction in this space, synthesizes sequences of latent indices. We further design a novel conditioning block to handle future conditioning information in such a causal model by using a network with two branches to compute separate stacks of features. In this manner, Purposer can generate realistic motion sequences in diverse test scenes. Through exhaustive evaluation, we demonstrate that our multi-contextual solution outperforms existing specialized approaches for specific contextual information, both in terms of quality and diversity. Our model is trained with short sequences, but a byproduct of being able to use various conditioning signals is that at test time different combinations can be used to chain short sequences together and generate long motions within a context scene.
我们提出了一种名为Purposer的新方法,基于神经离散表示学习。我们的模型能够以灵活的方式利用开放访问的大规模数据集AMASS中已经存在的不同类型的信息。首先,我们将无条件的人类运动编码到一个离散的潜在空间中。然后,一个条件生成模型,通过关键的上下文信息条件,以提示或添加标记的方式进行训练,并在该空间中进行下一步预测,合成了一系列的潜在索引。我们进一步设计了一个新的条件模块,用于在具有因果关系的模型中处理未来的条件信息,通过使用具有两个分支的网络计算不同的特征栈。这样,Purposer可以在各种测试场景中生成逼真的运动序列。通过彻底的评估,我们证明了我们的多上下文解决方案在特定上下文信息方面的现有专业方法中具有优越性,无论是质量还是多样性。我们的模型使用短序列进行训练,但能够使用各种上下文信号的原因是,在测试时可以使用不同的组合将短序列串联起来并在上下文场景中生成长动作。
https://arxiv.org/abs/2404.12942
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
In this paper, we propose a new Multimodal Representation Learning (MRL) method for Multimodal Sentiment Analysis (MSA), which facilitates the adaptive interaction between modalities through Cooperative Sentiment Agents, named Co-SA. Co-SA comprises two critical components: the Sentiment Agents Establishment (SAE) phase and the Sentiment Agents Cooperation (SAC) phase. During the SAE phase, each sentiment agent deals with an unimodal signal and highlights explicit dynamic sentiment variations within the modality via the Modality-Sentiment Disentanglement (MSD) and Deep Phase Space Reconstruction (DPSR) modules. Subsequently, in the SAC phase, Co-SA meticulously designs task-specific interaction mechanisms for sentiment agents so that coordinating multimodal signals to learn the joint representation. Specifically, Co-SA equips an independent policy model for each sentiment agent that captures significant properties within the modality. These policies are optimized mutually through the unified reward adaptive to downstream tasks. Benefitting from the rewarding mechanism, Co-SA transcends the limitation of pre-defined fusion modes and adaptively captures unimodal properties for MRL in the multimodal interaction setting. To demonstrate the effectiveness of Co-SA, we apply it to address Multimodal Sentiment Analysis (MSA) and Multimodal Emotion Recognition (MER) tasks. Our comprehensive experimental results demonstrate that Co-SA excels at discovering diverse cross-modal features, encompassing both common and complementary aspects. The code can be available at this https URL.
在本文中,我们提出了一个新的多模态表示学习(MRL)方法,名为合作情感代理(Co-SA),用于多模态情感分析(MSA),并通过合作情感代理促进模态之间的自适应交互。Co-SA包括两个关键组件:情感代理建立(SAE)阶段和情感代理合作(SAC)阶段。在SAE阶段,每个情感代理处理一个单模态信号,并通过模态情感解离(MSD)和深度时域重构(DPSR)模块在模态内突出显示动态情感变化。然后,在SAC阶段,Co-SA精心设计了一系列任务特定的情感代理交互机制,以协调多模态信号以学习联合表示。具体来说,Co-SA为每个情感代理配备了一个独立的政策模型,该模型捕捉模态内的显著属性。这些策略通过统一奖励适应下游任务进行优化。得益于奖励机制,Co-SA超越了预定义的融合模式,并适应了多模态交互设置中的情感代理学习(MRL)。为了证明Co-SA的有效性,我们将它应用于情感多模态分析和情感识别任务。我们全面的实验结果表明,Co-SA在发现跨模态特征方面表现出色,涵盖模态共性和互补性的各个方面。代码可以从该链接获取。
https://arxiv.org/abs/2404.12642
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
自监督学习(SSL)作为一种无需标注的学习技术,在医学图像分析领域呈现出巨大的潜力。然而,尽管具有潜在的积极影响,传统的 SSL 方法也存在局限性,包括在实现语义对齐和捕捉细微细节方面遇到的挑战。这导致 suboptimal 表示,无法准确捕捉到解剖学结构和病理细节。为了应对这些限制,我们引入了一个名为 OPTiML 的新 SSL 框架,采用最优传输(OT)技术,以捕捉密集的语义不变性和细粒度细节,从而增强 SSL 在医学图像表示学习中的整体效果。核心思想是将 OT 与跨视点语义注入模块(CV-SIM)相结合,有效地捕捉不同观点下医学图像中复杂、细粒度的细节。除了 CV-SIM 模块之外,OPTiML 对 OT 框架内的方差和协方差进行正则化,以迫使模型将注意力集中在临床相关信息上,而忽略更不相关的特征。通过这些,所提出的框架展示了其学习语义丰富表示的能力,可以应用于各种医学成像任务。为了验证其有效性,我们在三个公开可用的数据集(包括胸部 X 光摄影模式)上进行了实验研究。我们的实证结果表明,OPTiML 在所有评估任务上都优于最先进的 methods。
https://arxiv.org/abs/2404.11868
In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.
在本文中,我们在半监督学习框架下提出了一种名为提示驱动特征扩散(PDFD)的新方法,用于开放世界半监督学习(OW-SSL)。其核心思想是,PDFD通过类特定提示来指导类级别特征级扩散模型,支持分类特征表示学习和特征生成,解决了OW-SSL中未见类别的标签数据不足的挑战。 具体来说,PDFD利用类原型作为扩散模型的提示,利用它们的分类歧视性和语义泛化能力来对所有可见和不可见类别的扩散过程进行条件和引导。此外,PDFD引入了分类条件 adversarial loss for diffusion model training,确保通过扩散过程生成的特征与真实数据的类条件特征对齐。 另外,类原型的计算仅在半监督学习框架中使用具有自信预测的未标注实例。我们通过广泛的实验评估了所提出的PDFD。实验结果表明,与现有方法相比,PDFD具有显著的性能增强。
https://arxiv.org/abs/2404.11795
This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.
本文重点探讨了通过探索泛化界和表示学习来降低联邦学习中的通信成本。首先,我们基于局部客户端的泛化能力和数据分布异质性(非iid场景)定义了一个更紧的泛化界。然后,我们在R轮联邦学习和其与本地更新数量的关系上进行了定义。基于我们对泛化界分析的推理和表示学习的解释,我们证明了表示提取器(通常对应于初始层)进行更少的聚合会导致创建更具有泛化能力的模型,尤其是在非iid场景中。我们基于泛化界和表示学习分析设计了一种名为FedALS的新联邦学习算法。FedALS采用不同的聚合频率来对模型的不同部分进行动态调整,从而降低了通信成本。本文附有实验结果,展示了FedALS的有效性。
https://arxiv.org/abs/2404.11754
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
流行的表示学习方法鼓励在应用于输入时的变换下保持特征的不变性。然而,在像物体定位和分割这样的3D感知任务中,输出自然地对某些变换(例如旋转)具有等价性。通过使用鼓励在某些变换下保持特征等价的预训练损失函数,可以提供强大的自监督信号,同时保留变换前特征表示之间几何关系的信息。这可以提高在下游具有这种变换的任务的性能。在本文中,我们提出了一种空间和时间等价的表示学习框架,通过同时考虑空间和时间增强。我们的实验表明,最佳性能通过鼓励对平移、缩放和翻转、旋转和场景流动的等价性来实现。对于空间增强,我们发现,根据变换,无论是对比性目标还是类比目标都能获得最佳结果。为了利用真实的物体变形和运动,我们考虑了连续的激光雷达场景对,并开发了一个新的基于3D场景流的三等价目标,这使得整体性能得到提高。我们证明了我们的预训练方法在许多设置中优于现有的等价和不变方法。
https://arxiv.org/abs/2404.11737
Change detection aims to identify remote sense object changes by analyzing data between bitemporal image pairs. Due to the large temporal and spatial span of data collection in change detection image pairs, there are often a significant amount of task-specific and task-agnostic noise. Previous effort has focused excessively on denoising, with this goes a great deal of loss of fine-grained information. In this paper, we revisit the importance of fine-grained features in change detection and propose a series of operations for fine-grained information compensation and noise decoupling (FINO). First, the context is utilized to compensate for the fine-grained information in the feature space. Next, a shape-aware and a brightness-aware module are designed to improve the capacity for representation learning. The shape-aware module guides the backbone for more precise shape estimation, guiding the backbone network in extracting object shape features. The brightness-aware module learns a overall brightness estimation to improve the model's robustness to task-agnostic noise. Finally, a task-specific noise decoupling structure is designed as a way to improve the model's ability to separate noise interference from feature similarity. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in multiple change detection benchmarks. The code will be made available.
变化检测旨在通过分析位图图像对之间的数据来识别远程感物体更改。由于数据收集的变化检测图像对具有较大的时间和空间范围,因此通常存在大量与任务特定和任务无关的噪声。之前的工作主要集中在去噪,这导致了很多细节信息的丢失。在本文中,我们重新强调了在变化检测中关注细粒度特征的重要性,并提出了细粒度信息补偿和噪声解耦(FINO)的一系列操作。首先,在特征空间中利用上下文进行细粒度信息的补偿。然后,设计了一个形状感知和一个亮度感知模块,以提高表示学习的能力。形状感知模块指导网络进行更精确的形状估计,引导网络提取物体形状特征。亮度感知模块学习一个总的亮度估计,以提高模型对任务无关噪声的鲁棒性。最后,为了提高模型将噪声干扰与特征相似性区分开的能力,设计了一种任务特定的噪声解耦结构。通过这些训练方案,我们提出的方法在多个变化检测基准测试中实现了最先进的(SOTA)结果。代码将公开提供。
https://arxiv.org/abs/2404.11318
Time series anomaly detection (TAD) faces a significant challenge due to the scarcity of labelled data, which hinders the development of accurate detection models. Unsupervised domain adaptation (UDA) addresses this challenge by leveraging a labelled dataset from a related domain to detect anomalies in a target dataset. Existing domain adaptation techniques assume that the number of anomalous classes does not change between the source and target domains. In this paper, we propose a novel Domain Adaptation Contrastive learning for Anomaly Detection in multivariate time series (DACAD) model to address this issue by combining UDA and contrastive representation learning. DACAD's approach includes an anomaly injection mechanism that introduces various types of synthetic anomalies, enhancing the model's ability to generalise across unseen anomalous classes in different domains. This method significantly broadens the model's adaptability and robustness. Additionally, we propose a supervised contrastive loss for the source domain and a self-supervised contrastive triplet loss for the target domain, improving comprehensive feature representation learning and extraction of domain-invariant features. Finally, an effective Centre-based Entropy Classifier (CEC) is proposed specifically for anomaly detection, facilitating accurate learning of normal boundaries in the source domain. Our extensive evaluation across multiple real-world datasets against leading models in time series anomaly detection and UDA underscores DACAD's effectiveness. The results validate DACAD's superiority in transferring knowledge across domains and its potential to mitigate the challenge of limited labelled data in time series anomaly detection.
时间序列异常检测(TAD)面临一个重大挑战,因为稀疏的标注数据,这阻碍了准确检测模型的开发。无监督域适应(UDA)通过利用相关领域有标注的数据集中的标注数据来检测目标数据集中的异常,从而解决了这个挑战。现有的域适应技术假设源域和目标域中的异常类数量不变。在本文中,我们提出了一个名为多维时间序列异常检测(DACAD)的新域适应异常检测模型,通过结合UDA和对比表示学习来解决这个挑战。DACAD的方法包括异常注入机制,引入各种类型的合成异常,增强了模型在不同领域的未见异常类上的泛化能力。这种方法显著拓宽了模型的适应性和稳健性。此外,我们提出了针对源域的监督对比损失和针对目标域的自监督对比三元组损失,提高了全面特征表示学习和提取域间特征的能力。最后,我们提出了一个特定的基于中心的熵分类器(CEC)用于异常检测,从而准确地学习源域中的正常边界。我们对多个现实世界数据集进行广泛的评估,与时间序列异常检测领域的领先模型进行对比,证明了DACAD的有效性。结果证实了DACAD在跨领域传递知识和减轻时间序列异常检测领域有限标注数据方面的优越性。
https://arxiv.org/abs/2404.11269
ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. However, most of them ignore the code hierarchy, leading to improper code assignments. To address these problems, we propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.we utilize the code description and the hierarchical structure inherent to the ICD codes. Therefore, in this paper, we leverage the code description and the hierarchical structure inherent to the ICD codes. The code description is also applied to aware the attention layer and output layer. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.
ICD(国际疾病分类)编码是将患者访问分配给他们的医疗记录的ICD代码的过程。由于嘈杂的医疗文件输入,ICD编码是一个具有多个标签的多标签文本分类问题。最近,自动ICD编码通过将额外数据和知识库与医疗记录的编码相结合来提高性能。然而,大多数忽略代码层次结构,导致不当的代码分配。为了应对这些问题,我们提出了一个基于相关和分层代码描述蒸馏(AHDD)的新框架,以进行更好的代码表示学习和避免不当代码分配。我们利用了ICD代码固有的代码描述和层次结构。因此,在本文中,我们利用了ICD代码的代码描述和层次结构。将代码描述还应用于注意层和输出层,以增强模型的关注度。基准数据集上的实验结果表明,与最先进的基线相比,所提出的框架具有优越性。
https://arxiv.org/abs/2404.11132
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.
近年来,由于其在各种多模态应用中的卓越表现,预训练的多模态大型模型吸引了广泛的关注。然而,其训练所需的广泛计算资源和大量数据在计算资源有限的环境中仍然构成了显著的障碍。为了应对这一挑战,我们提出了一个新颖的动态自适应多尺度蒸馏方法,作为第一个将预训练多模态大型模型的蒸馏应用到高效的跨模态表示学习中的方法。与现有的蒸馏方法不同,我们的策略采用多尺度视角,使得可以从预训练的多模态大型模型中提取结构知识。确保学生模型继承了教师知识的全局和细微理解。为了以平衡和高效的方式优化每个蒸馏损失,我们提出了动态自适应蒸馏损失平衡器,这是一种新的组件,在蒸馏过程中动态平衡每个损失项。我们的方法通过仅利用预训练模型的输出特征和原始图像级信息来简化预训练多模态大型模型的部署,并仅需要最小的计算资源。这种高效的 approach 适合各种应用,甚至可以在资源受限的环境中部署先进的跨模态技术。大量实验证明,我们的方法在保持高性能的同时显著减少了模型复杂度和训练成本。此外,我们蒸馏出的学生模型仅利用图像级信息实现了跨模态检索任务的 state-of-the-art性能,超越了之前依赖区域级信息的方法。
https://arxiv.org/abs/2404.10838
Recently, heterogeneous graph neural networks (HGNNs) have achieved impressive success in representation learning by capturing long-range dependencies and heterogeneity at the node level. However, few existing studies have delved into the utilization of node attributes in heterogeneous information networks (HINs). In this paper, we investigate the impact of inter-node attribute disparities on HGNNs performance within the benchmark task, i.e., node classification, and empirically find that typical models exhibit significant performance decline when classifying nodes whose attributes markedly differ from their neighbors. To alleviate this issue, we propose a novel Attribute-Guided heterogeneous Information Networks representation learning model with Transformer (AGHINT), which allows a more effective aggregation of neighbor node information under the guidance of attributes. Specifically, AGHINT transcends the constraints of the original graph structure by directly integrating higher-order similar neighbor features into the learning process and modifies the message-passing mechanism between nodes based on their attribute disparities. Extensive experimental results on three real-world heterogeneous graph benchmarks with target node attributes demonstrate that AGHINT outperforms the state-of-the-art.
近年来,异构图神经网络(HGNNs)通过在节点级别捕捉长距离依赖和异质性,在表示学习方面取得了令人印象深刻的成功。然而,很少有现有研究深入探讨异质信息网络(HINs)中节点属性的利用。在本文中,我们研究了节点属性差异对HGNNs性能的影响,即节点分类,并经验性地发现,当模型的邻居节点的属性与它们的邻居显著不同时,典型模型表现出显著的性能下降。为了减轻这个问题,我们提出了一个名为异构信息网络中的自适应特征指导的HIN表示学习模型(AGHINT),它允许在属性的指导下更有效地聚合邻居节点信息。具体来说,AGHINT超越了原始图结构的限制,通过直接将高级别类似邻居特征集成到学习过程中,并基于节点属性的差异修改了节点之间的消息传递机制。在三个真实世界异质图基准测试中,带有目标节点属性的实验结果表明,AGHINT超越了最先进的水平。
https://arxiv.org/abs/2404.10443
Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance.
归纳偏见在解离表示学习中的关键作用在于缩小不明确解集。在这项工作中,我们考虑为神经网络自编码器赋予来自文献中的三个选择性归纳偏见:通过量化将数据压缩到类似于网格状的潜在空间,以及在潜在之间实现集体独立,以及对任何潜在对其他潜在如何确定数据生成的最小功能影响。原则上,这些归纳偏见是深刻互补的:它们最直接地指定潜在空间的性质、编码器和解码器的属性。然而,在实践中,简单地组合现有的技术实例这些归纳偏见往往无法带来显著的益处。为了解决这个问题,我们提出了三种适应技术,简化学习问题,为关键正则化项分配稳定不变性,以及遏制退化激励。所得到的模型Tripod在四个图像解离基准测试中实现了最先进的结果。我们还验证了Tripod在它的原始形式上明显优于 naive 版本,而且它的三个“腿”对于最佳性能都是必要的。
https://arxiv.org/abs/2404.10282
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.
大规模视觉2D视觉语言模型,如CLIP,可以通过与3D编码器对齐来学习具有通用的(开放词汇)3D视觉模型。然而,现有的方法需要进行有监督的预训练才能实现这种对齐,而这种3D零 shot模型的性能在现实世界适应中仍然存在很大的提升空间。在这项工作中,我们提出了一个优化框架:Cross-MoST:跨模态自训练,通过简单地利用未标注的3D数据及其相关2D视图来提高零 shot 3D视觉模型的无标签分类性能。我们提出了一个学生-教师框架来同时处理2D视图和3D点云,并生成联合伪标签来训练分类器和指导跨模型特征对齐。因此,我们证明了CLIP这样的2D视觉语言模型可以作为补充3D表示学习来提高分类性能,而无需进行昂贵的类注释。使用合成和真实世界3D数据集,我们进一步证明了Cross-MoST能够实现高效的跨模态知识交流,从而实现图像和点云模态之间的知识学习。
https://arxiv.org/abs/2404.10146
In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain specific knowledge in pathology. Specifically, we make the following contributions: (i) We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues. To our knowledge, this is the first comprehensive structured pathology knowledge base; (ii) We develop a knowledge-enhanced visual-language pretraining approach, where we first project pathology-specific knowledge into latent embedding space via language model, and use it to guide the visual representation learning; (iii) We conduct thorough experiments to validate the effectiveness of our proposed components, demonstrating significant performance improvement on various downstream tasks, including cross-modal retrieval, zero-shot classification on pathology patches, and zero-shot tumor subtyping on whole slide images (WSIs). All codes, models and the pathology knowledge tree will be released to the research community
在本文中,我们考虑了计算机病理学中视觉表示学习的问题,通过利用来自公共资源的大型图像-文本对,结合病理学领域的专业知识。具体来说,我们做出了以下贡献:(一)我们创建了一个包含32个人体组织病理学诊断的4718种疾病的信息丰富的病理学知识树,这是目前为止最全面的结构化病理学知识库;(二)我们开发了一种知识增强的视觉语言预训练方法,其中我们首先通过语言模型将病理学特定知识投影到潜在表示空间,并用于指导视觉表示学习;(三)我们进行了详细实验来验证我们提议的组件的有效性,证明了其在各种下游任务上的显著性能提升,包括跨模态检索、病理补丁上的零散分类和整个幻灯片图像(WSIs)上的肿瘤亚型检测。所有代码、模型和病理学知识树都将公开发布给研究社区。
https://arxiv.org/abs/2404.09942
Message passing has become the dominant framework in graph representation learning. The essential idea of the message-passing framework is to update node embeddings based on the information aggregated from local neighbours. However, most existing aggregation methods have not encoded neighbour-level message interactions into the aggregated message, resulting in an information lost in embedding generation. And this information lost could be accumulated and become more serious as more layers are added to the graph network model. To address this issue, we propose a neighbour-level message interaction information encoding method for improving graph representation learning. For messages that are aggregated at a node, we explicitly generate an encoding between each message and the rest messages using an encoding function. Then we aggregate these learned encodings and take the sum of the aggregated encoding and the aggregated message to update the embedding for the node. By this way, neighbour-level message interaction information is integrated into the generated node embeddings. The proposed encoding method is a generic method which can be integrated into message-passing graph convolutional networks. Extensive experiments are conducted on six popular benchmark datasets across four highly-demanded tasks. The results show that integrating neighbour-level message interactions achieves improved performance of the base models, advancing the state of the art results for representation learning over graphs.
消息传递已成为图表示学习的主导框架。消息传递框架的基本思想是根据本地邻居的信息更新节点嵌入。然而,大多数现有的聚合方法都没有将邻居级别的消息交互编码到聚合的消息中,导致在嵌入生成中丢失信息。而且,随着图网络模型的层数越来越多,这种信息损失可能会变得越来越严重。为了解决这个问题,我们提出了一个邻居级别消息交互信息编码方法,以提高图表示学习。 对于在节点上聚合的消息,我们使用编码函数在每条消息和其余消息之间显式生成编码。然后我们聚合这些学习到的编码,并将聚合编码和聚合消息相加来更新节点的嵌入。通过这种方式,将邻居级别消息交互信息融入生成的节点嵌入中。所提出的编码方法是一种通用方法,可以集成到消息传递图卷积网络中。我们在四个高需求任务上对六个流行基准数据集进行了广泛的实验。实验结果表明,将邻居级别的消息交互集成确实实现了基模型的更好性能,推动了图表示学习在图形上的 state-of-the-art 结果。
https://arxiv.org/abs/2404.09809
Studies continually find that message-passing graph convolutional networks suffer from the over-smoothing issue. Basically, the issue of over-smoothing refers to the phenomenon that the learned embeddings for all nodes can become very similar to one another and therefore are uninformative after repeatedly applying message passing iterations. Intuitively, we can expect the generated embeddings become smooth asymptotically layerwisely, that is each layer of graph convolution generates a smoothed version of embeddings as compared to that generated by the previous layer. Based on this intuition, we propose RandAlign, a stochastic regularization method for graph convolutional networks. The idea of RandAlign is to randomly align the learned embedding for each node with that of the previous layer using randomly interpolation in each graph convolution layer. Through alignment, the smoothness of the generated embeddings is explicitly reduced. To better maintain the benefit yielded by the graph convolution, in the alignment step we introduce to first scale the embedding of the previous layer to the same norm as the generated embedding and then perform random interpolation for aligning the generated embedding. RandAlign is a parameter-free method and can be directly applied without introducing additional trainable weights or hyper-parameters. We experimentally evaluate RandAlign on different graph domain tasks on seven benchmark datasets. The experimental results show that RandAlign is a general method that improves the generalization performance of various graph convolutional network models and also improves the numerical stability of optimization, advancing the state of the art performance for graph representation learning.
研究表明,消息传递图卷积网络存在过平滑问题。本质上,过平滑是指所有节点的学习嵌入变得非常相似,因此多次应用消息传递迭代后,这些嵌入变得不再具有信息价值。直观上,我们可以预期生成的嵌入在层际上会变得平滑,即与前一层生成的嵌入相比,每个层生成的嵌入都会产生平滑的版本。根据这个直觉,我们提出了RandAlign,一种随机 regularization 方法,用于图卷积网络。RandAlign 的思想是通过在图卷积层中使用随机插值来随机对齐每个节点的学习嵌入,从而减少生成的嵌入的平滑度。为了更好地保持图卷积带来的好处,在对齐步骤中,我们首先将前层的嵌入缩放到与生成的嵌入相同的正则下限,然后对生成的嵌入进行随机插值以进行对齐。RandAlign 是一种无参数方法,可以直接应用而无需引入额外的训练权重或超参数。我们在七个基准数据集上对RandAlign 进行了实验评估。实验结果表明,RandAlign 是一种通用的方法,可以提高各种图卷积网络模型的泛化性能,同时提高优化算法的数值稳定性,推动图形表示学习领域的最新进展。
https://arxiv.org/abs/2404.09774
Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the interplay between representation learning using an additional task-specific contrastive loss and post-hoc concept explainability for socioeconomic studies. Our results on two different geographical locations and tasks indicate that the task-specific pretraining imposes a continuous ordering of the latent space embeddings according to the socioeconomic outcomes. This improves the model's interpretability as it enables the latent space of the model to associate urban concepts with continuous intervals of socioeconomic outcomes. Further, we illustrate how analyzing the model's conceptual sensitivity for the intervals of socioeconomic outcomes can shed light on new insights for urban studies.
使用深度学习预测卫星影像中的社会学指标已经成为越来越受欢迎的研究方向。基于概念的后验解释可以成为在政策制定中更广泛采用这些模型的一个重要步骤,因为它们使得根据人类直觉直观的视觉概念来解释社会学结果。在本文中,我们研究了使用任务特定对比损失的表示学习和社会学研究中的后验概念解释之间的相互作用。我们在两个不同的地理位置和任务上的结果表明,任务特定的预训练强制对潜在空间嵌入进行连续排序,根据社会学结果。这使得模型的可解释性得到提高,因为它使得模型的潜在空间能够将城市概念与连续的社会经济结果间隔相联系。此外,我们说明了分析模型对社会学结果间隔的敏感性如何能够为城市研究带来新的洞见。
https://arxiv.org/abs/2404.09768
When building classification systems with demographic fairness considerations, there are two objectives to satisfy: 1) maximizing utility for the specific task and 2) ensuring fairness w.r.t. a known demographic attribute. These objectives often compete, so optimizing both can lead to a trade-off between utility and fairness. While existing works acknowledge the trade-offs and study their limits, two questions remain unanswered: 1) What are the optimal trade-offs between utility and fairness? and 2) How can we numerically quantify these trade-offs from data for a desired prediction task and demographic attribute of interest? This paper addresses these questions. We introduce two utility-fairness trade-offs: the Data-Space and Label-Space Trade-off. The trade-offs reveal three regions within the utility-fairness plane, delineating what is fully and partially possible and impossible. We propose U-FaTE, a method to numerically quantify the trade-offs for a given prediction task and group fairness definition from data samples. Based on the trade-offs, we introduce a new scheme for evaluating representations. An extensive evaluation of fair representation learning methods and representations from over 1000 pre-trained models revealed that most current approaches are far from the estimated and achievable fairness-utility trade-offs across multiple datasets and prediction tasks.
在考虑人口公平性时构建分类系统时,有两个目标需要满足:1)最大化特定任务的效用,2)确保已知人口属性的公平性。这两个目标通常会竞争,因此优化这两个目标可能会导致效用和公平性的权衡。尽管现有的工作承认这些权衡并研究其局限性,但两个问题仍然未得到回答:1)效用和公平性之间的最优权衡是什么?2)我们如何从数据中数值量化这些权衡,以便为所感兴趣的预测任务和人口属性制定预测?本文回答了这些问题。我们引入了两种效用-公平性权衡:数据空间和标签空间权衡。权衡揭示了效用-公平性平面上的三个区域,区分了完全和部分可能性和不可能的情况。我们提出了U-FaTE,一种从数据样本中数值量化权衡的方法,用于特定预测任务和人口定义。基于权衡,我们引入了一种新的评估表示的方案。对超过1000个预训练模型的公平表示学习方法和表示的深入评估表明,大多数现有方法离预计和可实现公平-效用权衡相差很远。
https://arxiv.org/abs/2404.09454