In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.
在本文中,我们在半监督学习框架下提出了一种名为提示驱动特征扩散(PDFD)的新方法,用于开放世界半监督学习(OW-SSL)。其核心思想是,PDFD通过类特定提示来指导类级别特征级扩散模型,支持分类特征表示学习和特征生成,解决了OW-SSL中未见类别的标签数据不足的挑战。 具体来说,PDFD利用类原型作为扩散模型的提示,利用它们的分类歧视性和语义泛化能力来对所有可见和不可见类别的扩散过程进行条件和引导。此外,PDFD引入了分类条件 adversarial loss for diffusion model training,确保通过扩散过程生成的特征与真实数据的类条件特征对齐。 另外,类原型的计算仅在半监督学习框架中使用具有自信预测的未标注实例。我们通过广泛的实验评估了所提出的PDFD。实验结果表明,与现有方法相比,PDFD具有显著的性能增强。
https://arxiv.org/abs/2404.11795
This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.
本文重点探讨了通过探索泛化界和表示学习来降低联邦学习中的通信成本。首先,我们基于局部客户端的泛化能力和数据分布异质性(非iid场景)定义了一个更紧的泛化界。然后,我们在R轮联邦学习和其与本地更新数量的关系上进行了定义。基于我们对泛化界分析的推理和表示学习的解释,我们证明了表示提取器(通常对应于初始层)进行更少的聚合会导致创建更具有泛化能力的模型,尤其是在非iid场景中。我们基于泛化界和表示学习分析设计了一种名为FedALS的新联邦学习算法。FedALS采用不同的聚合频率来对模型的不同部分进行动态调整,从而降低了通信成本。本文附有实验结果,展示了FedALS的有效性。
https://arxiv.org/abs/2404.11754
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
流行的表示学习方法鼓励在应用于输入时的变换下保持特征的不变性。然而,在像物体定位和分割这样的3D感知任务中,输出自然地对某些变换(例如旋转)具有等价性。通过使用鼓励在某些变换下保持特征等价的预训练损失函数,可以提供强大的自监督信号,同时保留变换前特征表示之间几何关系的信息。这可以提高在下游具有这种变换的任务的性能。在本文中,我们提出了一种空间和时间等价的表示学习框架,通过同时考虑空间和时间增强。我们的实验表明,最佳性能通过鼓励对平移、缩放和翻转、旋转和场景流动的等价性来实现。对于空间增强,我们发现,根据变换,无论是对比性目标还是类比目标都能获得最佳结果。为了利用真实的物体变形和运动,我们考虑了连续的激光雷达场景对,并开发了一个新的基于3D场景流的三等价目标,这使得整体性能得到提高。我们证明了我们的预训练方法在许多设置中优于现有的等价和不变方法。
https://arxiv.org/abs/2404.11737
Change detection aims to identify remote sense object changes by analyzing data between bitemporal image pairs. Due to the large temporal and spatial span of data collection in change detection image pairs, there are often a significant amount of task-specific and task-agnostic noise. Previous effort has focused excessively on denoising, with this goes a great deal of loss of fine-grained information. In this paper, we revisit the importance of fine-grained features in change detection and propose a series of operations for fine-grained information compensation and noise decoupling (FINO). First, the context is utilized to compensate for the fine-grained information in the feature space. Next, a shape-aware and a brightness-aware module are designed to improve the capacity for representation learning. The shape-aware module guides the backbone for more precise shape estimation, guiding the backbone network in extracting object shape features. The brightness-aware module learns a overall brightness estimation to improve the model's robustness to task-agnostic noise. Finally, a task-specific noise decoupling structure is designed as a way to improve the model's ability to separate noise interference from feature similarity. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in multiple change detection benchmarks. The code will be made available.
变化检测旨在通过分析位图图像对之间的数据来识别远程感物体更改。由于数据收集的变化检测图像对具有较大的时间和空间范围,因此通常存在大量与任务特定和任务无关的噪声。之前的工作主要集中在去噪,这导致了很多细节信息的丢失。在本文中,我们重新强调了在变化检测中关注细粒度特征的重要性,并提出了细粒度信息补偿和噪声解耦(FINO)的一系列操作。首先,在特征空间中利用上下文进行细粒度信息的补偿。然后,设计了一个形状感知和一个亮度感知模块,以提高表示学习的能力。形状感知模块指导网络进行更精确的形状估计,引导网络提取物体形状特征。亮度感知模块学习一个总的亮度估计,以提高模型对任务无关噪声的鲁棒性。最后,为了提高模型将噪声干扰与特征相似性区分开的能力,设计了一种任务特定的噪声解耦结构。通过这些训练方案,我们提出的方法在多个变化检测基准测试中实现了最先进的(SOTA)结果。代码将公开提供。
https://arxiv.org/abs/2404.11318
Time series anomaly detection (TAD) faces a significant challenge due to the scarcity of labelled data, which hinders the development of accurate detection models. Unsupervised domain adaptation (UDA) addresses this challenge by leveraging a labelled dataset from a related domain to detect anomalies in a target dataset. Existing domain adaptation techniques assume that the number of anomalous classes does not change between the source and target domains. In this paper, we propose a novel Domain Adaptation Contrastive learning for Anomaly Detection in multivariate time series (DACAD) model to address this issue by combining UDA and contrastive representation learning. DACAD's approach includes an anomaly injection mechanism that introduces various types of synthetic anomalies, enhancing the model's ability to generalise across unseen anomalous classes in different domains. This method significantly broadens the model's adaptability and robustness. Additionally, we propose a supervised contrastive loss for the source domain and a self-supervised contrastive triplet loss for the target domain, improving comprehensive feature representation learning and extraction of domain-invariant features. Finally, an effective Centre-based Entropy Classifier (CEC) is proposed specifically for anomaly detection, facilitating accurate learning of normal boundaries in the source domain. Our extensive evaluation across multiple real-world datasets against leading models in time series anomaly detection and UDA underscores DACAD's effectiveness. The results validate DACAD's superiority in transferring knowledge across domains and its potential to mitigate the challenge of limited labelled data in time series anomaly detection.
时间序列异常检测(TAD)面临一个重大挑战,因为稀疏的标注数据,这阻碍了准确检测模型的开发。无监督域适应(UDA)通过利用相关领域有标注的数据集中的标注数据来检测目标数据集中的异常,从而解决了这个挑战。现有的域适应技术假设源域和目标域中的异常类数量不变。在本文中,我们提出了一个名为多维时间序列异常检测(DACAD)的新域适应异常检测模型,通过结合UDA和对比表示学习来解决这个挑战。DACAD的方法包括异常注入机制,引入各种类型的合成异常,增强了模型在不同领域的未见异常类上的泛化能力。这种方法显著拓宽了模型的适应性和稳健性。此外,我们提出了针对源域的监督对比损失和针对目标域的自监督对比三元组损失,提高了全面特征表示学习和提取域间特征的能力。最后,我们提出了一个特定的基于中心的熵分类器(CEC)用于异常检测,从而准确地学习源域中的正常边界。我们对多个现实世界数据集进行广泛的评估,与时间序列异常检测领域的领先模型进行对比,证明了DACAD的有效性。结果证实了DACAD在跨领域传递知识和减轻时间序列异常检测领域有限标注数据方面的优越性。
https://arxiv.org/abs/2404.11269
ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. However, most of them ignore the code hierarchy, leading to improper code assignments. To address these problems, we propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.we utilize the code description and the hierarchical structure inherent to the ICD codes. Therefore, in this paper, we leverage the code description and the hierarchical structure inherent to the ICD codes. The code description is also applied to aware the attention layer and output layer. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.
ICD(国际疾病分类)编码是将患者访问分配给他们的医疗记录的ICD代码的过程。由于嘈杂的医疗文件输入,ICD编码是一个具有多个标签的多标签文本分类问题。最近,自动ICD编码通过将额外数据和知识库与医疗记录的编码相结合来提高性能。然而,大多数忽略代码层次结构,导致不当的代码分配。为了应对这些问题,我们提出了一个基于相关和分层代码描述蒸馏(AHDD)的新框架,以进行更好的代码表示学习和避免不当代码分配。我们利用了ICD代码固有的代码描述和层次结构。因此,在本文中,我们利用了ICD代码的代码描述和层次结构。将代码描述还应用于注意层和输出层,以增强模型的关注度。基准数据集上的实验结果表明,与最先进的基线相比,所提出的框架具有优越性。
https://arxiv.org/abs/2404.11132
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.
近年来,由于其在各种多模态应用中的卓越表现,预训练的多模态大型模型吸引了广泛的关注。然而,其训练所需的广泛计算资源和大量数据在计算资源有限的环境中仍然构成了显著的障碍。为了应对这一挑战,我们提出了一个新颖的动态自适应多尺度蒸馏方法,作为第一个将预训练多模态大型模型的蒸馏应用到高效的跨模态表示学习中的方法。与现有的蒸馏方法不同,我们的策略采用多尺度视角,使得可以从预训练的多模态大型模型中提取结构知识。确保学生模型继承了教师知识的全局和细微理解。为了以平衡和高效的方式优化每个蒸馏损失,我们提出了动态自适应蒸馏损失平衡器,这是一种新的组件,在蒸馏过程中动态平衡每个损失项。我们的方法通过仅利用预训练模型的输出特征和原始图像级信息来简化预训练多模态大型模型的部署,并仅需要最小的计算资源。这种高效的 approach 适合各种应用,甚至可以在资源受限的环境中部署先进的跨模态技术。大量实验证明,我们的方法在保持高性能的同时显著减少了模型复杂度和训练成本。此外,我们蒸馏出的学生模型仅利用图像级信息实现了跨模态检索任务的 state-of-the-art性能,超越了之前依赖区域级信息的方法。
https://arxiv.org/abs/2404.10838
Recently, heterogeneous graph neural networks (HGNNs) have achieved impressive success in representation learning by capturing long-range dependencies and heterogeneity at the node level. However, few existing studies have delved into the utilization of node attributes in heterogeneous information networks (HINs). In this paper, we investigate the impact of inter-node attribute disparities on HGNNs performance within the benchmark task, i.e., node classification, and empirically find that typical models exhibit significant performance decline when classifying nodes whose attributes markedly differ from their neighbors. To alleviate this issue, we propose a novel Attribute-Guided heterogeneous Information Networks representation learning model with Transformer (AGHINT), which allows a more effective aggregation of neighbor node information under the guidance of attributes. Specifically, AGHINT transcends the constraints of the original graph structure by directly integrating higher-order similar neighbor features into the learning process and modifies the message-passing mechanism between nodes based on their attribute disparities. Extensive experimental results on three real-world heterogeneous graph benchmarks with target node attributes demonstrate that AGHINT outperforms the state-of-the-art.
近年来,异构图神经网络(HGNNs)通过在节点级别捕捉长距离依赖和异质性,在表示学习方面取得了令人印象深刻的成功。然而,很少有现有研究深入探讨异质信息网络(HINs)中节点属性的利用。在本文中,我们研究了节点属性差异对HGNNs性能的影响,即节点分类,并经验性地发现,当模型的邻居节点的属性与它们的邻居显著不同时,典型模型表现出显著的性能下降。为了减轻这个问题,我们提出了一个名为异构信息网络中的自适应特征指导的HIN表示学习模型(AGHINT),它允许在属性的指导下更有效地聚合邻居节点信息。具体来说,AGHINT超越了原始图结构的限制,通过直接将高级别类似邻居特征集成到学习过程中,并基于节点属性的差异修改了节点之间的消息传递机制。在三个真实世界异质图基准测试中,带有目标节点属性的实验结果表明,AGHINT超越了最先进的水平。
https://arxiv.org/abs/2404.10443
Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance.
归纳偏见在解离表示学习中的关键作用在于缩小不明确解集。在这项工作中,我们考虑为神经网络自编码器赋予来自文献中的三个选择性归纳偏见:通过量化将数据压缩到类似于网格状的潜在空间,以及在潜在之间实现集体独立,以及对任何潜在对其他潜在如何确定数据生成的最小功能影响。原则上,这些归纳偏见是深刻互补的:它们最直接地指定潜在空间的性质、编码器和解码器的属性。然而,在实践中,简单地组合现有的技术实例这些归纳偏见往往无法带来显著的益处。为了解决这个问题,我们提出了三种适应技术,简化学习问题,为关键正则化项分配稳定不变性,以及遏制退化激励。所得到的模型Tripod在四个图像解离基准测试中实现了最先进的结果。我们还验证了Tripod在它的原始形式上明显优于 naive 版本,而且它的三个“腿”对于最佳性能都是必要的。
https://arxiv.org/abs/2404.10282
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.
大规模视觉2D视觉语言模型,如CLIP,可以通过与3D编码器对齐来学习具有通用的(开放词汇)3D视觉模型。然而,现有的方法需要进行有监督的预训练才能实现这种对齐,而这种3D零 shot模型的性能在现实世界适应中仍然存在很大的提升空间。在这项工作中,我们提出了一个优化框架:Cross-MoST:跨模态自训练,通过简单地利用未标注的3D数据及其相关2D视图来提高零 shot 3D视觉模型的无标签分类性能。我们提出了一个学生-教师框架来同时处理2D视图和3D点云,并生成联合伪标签来训练分类器和指导跨模型特征对齐。因此,我们证明了CLIP这样的2D视觉语言模型可以作为补充3D表示学习来提高分类性能,而无需进行昂贵的类注释。使用合成和真实世界3D数据集,我们进一步证明了Cross-MoST能够实现高效的跨模态知识交流,从而实现图像和点云模态之间的知识学习。
https://arxiv.org/abs/2404.10146
In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain specific knowledge in pathology. Specifically, we make the following contributions: (i) We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues. To our knowledge, this is the first comprehensive structured pathology knowledge base; (ii) We develop a knowledge-enhanced visual-language pretraining approach, where we first project pathology-specific knowledge into latent embedding space via language model, and use it to guide the visual representation learning; (iii) We conduct thorough experiments to validate the effectiveness of our proposed components, demonstrating significant performance improvement on various downstream tasks, including cross-modal retrieval, zero-shot classification on pathology patches, and zero-shot tumor subtyping on whole slide images (WSIs). All codes, models and the pathology knowledge tree will be released to the research community
在本文中,我们考虑了计算机病理学中视觉表示学习的问题,通过利用来自公共资源的大型图像-文本对,结合病理学领域的专业知识。具体来说,我们做出了以下贡献:(一)我们创建了一个包含32个人体组织病理学诊断的4718种疾病的信息丰富的病理学知识树,这是目前为止最全面的结构化病理学知识库;(二)我们开发了一种知识增强的视觉语言预训练方法,其中我们首先通过语言模型将病理学特定知识投影到潜在表示空间,并用于指导视觉表示学习;(三)我们进行了详细实验来验证我们提议的组件的有效性,证明了其在各种下游任务上的显著性能提升,包括跨模态检索、病理补丁上的零散分类和整个幻灯片图像(WSIs)上的肿瘤亚型检测。所有代码、模型和病理学知识树都将公开发布给研究社区。
https://arxiv.org/abs/2404.09942
Message passing has become the dominant framework in graph representation learning. The essential idea of the message-passing framework is to update node embeddings based on the information aggregated from local neighbours. However, most existing aggregation methods have not encoded neighbour-level message interactions into the aggregated message, resulting in an information lost in embedding generation. And this information lost could be accumulated and become more serious as more layers are added to the graph network model. To address this issue, we propose a neighbour-level message interaction information encoding method for improving graph representation learning. For messages that are aggregated at a node, we explicitly generate an encoding between each message and the rest messages using an encoding function. Then we aggregate these learned encodings and take the sum of the aggregated encoding and the aggregated message to update the embedding for the node. By this way, neighbour-level message interaction information is integrated into the generated node embeddings. The proposed encoding method is a generic method which can be integrated into message-passing graph convolutional networks. Extensive experiments are conducted on six popular benchmark datasets across four highly-demanded tasks. The results show that integrating neighbour-level message interactions achieves improved performance of the base models, advancing the state of the art results for representation learning over graphs.
消息传递已成为图表示学习的主导框架。消息传递框架的基本思想是根据本地邻居的信息更新节点嵌入。然而,大多数现有的聚合方法都没有将邻居级别的消息交互编码到聚合的消息中,导致在嵌入生成中丢失信息。而且,随着图网络模型的层数越来越多,这种信息损失可能会变得越来越严重。为了解决这个问题,我们提出了一个邻居级别消息交互信息编码方法,以提高图表示学习。 对于在节点上聚合的消息,我们使用编码函数在每条消息和其余消息之间显式生成编码。然后我们聚合这些学习到的编码,并将聚合编码和聚合消息相加来更新节点的嵌入。通过这种方式,将邻居级别消息交互信息融入生成的节点嵌入中。所提出的编码方法是一种通用方法,可以集成到消息传递图卷积网络中。我们在四个高需求任务上对六个流行基准数据集进行了广泛的实验。实验结果表明,将邻居级别的消息交互集成确实实现了基模型的更好性能,推动了图表示学习在图形上的 state-of-the-art 结果。
https://arxiv.org/abs/2404.09809
Studies continually find that message-passing graph convolutional networks suffer from the over-smoothing issue. Basically, the issue of over-smoothing refers to the phenomenon that the learned embeddings for all nodes can become very similar to one another and therefore are uninformative after repeatedly applying message passing iterations. Intuitively, we can expect the generated embeddings become smooth asymptotically layerwisely, that is each layer of graph convolution generates a smoothed version of embeddings as compared to that generated by the previous layer. Based on this intuition, we propose RandAlign, a stochastic regularization method for graph convolutional networks. The idea of RandAlign is to randomly align the learned embedding for each node with that of the previous layer using randomly interpolation in each graph convolution layer. Through alignment, the smoothness of the generated embeddings is explicitly reduced. To better maintain the benefit yielded by the graph convolution, in the alignment step we introduce to first scale the embedding of the previous layer to the same norm as the generated embedding and then perform random interpolation for aligning the generated embedding. RandAlign is a parameter-free method and can be directly applied without introducing additional trainable weights or hyper-parameters. We experimentally evaluate RandAlign on different graph domain tasks on seven benchmark datasets. The experimental results show that RandAlign is a general method that improves the generalization performance of various graph convolutional network models and also improves the numerical stability of optimization, advancing the state of the art performance for graph representation learning.
研究表明,消息传递图卷积网络存在过平滑问题。本质上,过平滑是指所有节点的学习嵌入变得非常相似,因此多次应用消息传递迭代后,这些嵌入变得不再具有信息价值。直观上,我们可以预期生成的嵌入在层际上会变得平滑,即与前一层生成的嵌入相比,每个层生成的嵌入都会产生平滑的版本。根据这个直觉,我们提出了RandAlign,一种随机 regularization 方法,用于图卷积网络。RandAlign 的思想是通过在图卷积层中使用随机插值来随机对齐每个节点的学习嵌入,从而减少生成的嵌入的平滑度。为了更好地保持图卷积带来的好处,在对齐步骤中,我们首先将前层的嵌入缩放到与生成的嵌入相同的正则下限,然后对生成的嵌入进行随机插值以进行对齐。RandAlign 是一种无参数方法,可以直接应用而无需引入额外的训练权重或超参数。我们在七个基准数据集上对RandAlign 进行了实验评估。实验结果表明,RandAlign 是一种通用的方法,可以提高各种图卷积网络模型的泛化性能,同时提高优化算法的数值稳定性,推动图形表示学习领域的最新进展。
https://arxiv.org/abs/2404.09774
Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the interplay between representation learning using an additional task-specific contrastive loss and post-hoc concept explainability for socioeconomic studies. Our results on two different geographical locations and tasks indicate that the task-specific pretraining imposes a continuous ordering of the latent space embeddings according to the socioeconomic outcomes. This improves the model's interpretability as it enables the latent space of the model to associate urban concepts with continuous intervals of socioeconomic outcomes. Further, we illustrate how analyzing the model's conceptual sensitivity for the intervals of socioeconomic outcomes can shed light on new insights for urban studies.
使用深度学习预测卫星影像中的社会学指标已经成为越来越受欢迎的研究方向。基于概念的后验解释可以成为在政策制定中更广泛采用这些模型的一个重要步骤,因为它们使得根据人类直觉直观的视觉概念来解释社会学结果。在本文中,我们研究了使用任务特定对比损失的表示学习和社会学研究中的后验概念解释之间的相互作用。我们在两个不同的地理位置和任务上的结果表明,任务特定的预训练强制对潜在空间嵌入进行连续排序,根据社会学结果。这使得模型的可解释性得到提高,因为它使得模型的潜在空间能够将城市概念与连续的社会经济结果间隔相联系。此外,我们说明了分析模型对社会学结果间隔的敏感性如何能够为城市研究带来新的洞见。
https://arxiv.org/abs/2404.09768
When building classification systems with demographic fairness considerations, there are two objectives to satisfy: 1) maximizing utility for the specific task and 2) ensuring fairness w.r.t. a known demographic attribute. These objectives often compete, so optimizing both can lead to a trade-off between utility and fairness. While existing works acknowledge the trade-offs and study their limits, two questions remain unanswered: 1) What are the optimal trade-offs between utility and fairness? and 2) How can we numerically quantify these trade-offs from data for a desired prediction task and demographic attribute of interest? This paper addresses these questions. We introduce two utility-fairness trade-offs: the Data-Space and Label-Space Trade-off. The trade-offs reveal three regions within the utility-fairness plane, delineating what is fully and partially possible and impossible. We propose U-FaTE, a method to numerically quantify the trade-offs for a given prediction task and group fairness definition from data samples. Based on the trade-offs, we introduce a new scheme for evaluating representations. An extensive evaluation of fair representation learning methods and representations from over 1000 pre-trained models revealed that most current approaches are far from the estimated and achievable fairness-utility trade-offs across multiple datasets and prediction tasks.
在考虑人口公平性时构建分类系统时,有两个目标需要满足:1)最大化特定任务的效用,2)确保已知人口属性的公平性。这两个目标通常会竞争,因此优化这两个目标可能会导致效用和公平性的权衡。尽管现有的工作承认这些权衡并研究其局限性,但两个问题仍然未得到回答:1)效用和公平性之间的最优权衡是什么?2)我们如何从数据中数值量化这些权衡,以便为所感兴趣的预测任务和人口属性制定预测?本文回答了这些问题。我们引入了两种效用-公平性权衡:数据空间和标签空间权衡。权衡揭示了效用-公平性平面上的三个区域,区分了完全和部分可能性和不可能的情况。我们提出了U-FaTE,一种从数据样本中数值量化权衡的方法,用于特定预测任务和人口定义。基于权衡,我们引入了一种新的评估表示的方案。对超过1000个预训练模型的公平表示学习方法和表示的深入评估表明,大多数现有方法离预计和可实现公平-效用权衡相差很远。
https://arxiv.org/abs/2404.09454
We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precisely classifying video frames whether they should belong to the output summary video. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck. Experiments on two datasets(SumMe and TVSum) demonstrate the effectiveness of the proposed nimble model compared to existing state-of-the-art summarization approaches while being one order of magnitude more efficient in compute time and memory
我们提出了一个基于图的表示学习框架用于视频摘要。首先,我们将输入视频转换为一个图,其中节点对应于每个视频帧。然后,我们对图施加稀疏性,即只连接满足指定时间间隔的节点对。接下来,我们将视频摘要任务表示为二分类问题,即准确地分类视频帧,判断它们是否属于输出摘要视频。这种构建好的图旨在捕捉视频帧之间的长距离相互作用,稀疏性确保了在训练过程中不会遇到内存和计算瓶颈。在两个数据集(SumMe和TVSum)上的实验证明,与现有的视频摘要方法相比,所提出的敏捷模型具有更高的效果,同时可以在计算时间和内存上比现有方法节省一个数量级。
https://arxiv.org/abs/2404.10539
Deep clustering as an important branch of unsupervised representation learning focuses on embedding semantically similar samples into the identical feature space. This core demand inspires the exploration of contrastive learning and subspace clustering. However, these solutions always rely on the basic assumption that there are sufficient and category-balanced samples for generating valid high-level representation. This hypothesis actually is too strict to be satisfied for real-world applications. To overcome such a challenge, the natural strategy is utilizing generative models to augment considerable instances. How to use these novel samples to effectively fulfill clustering performance improvement is still difficult and under-explored. In this paper, we propose a novel Generative Calibration Clustering (GCC) method to delicately incorporate feature learning and augmentation into clustering procedure. First, we develop a discriminative feature alignment mechanism to discover intrinsic relationship across real and generated samples. Second, we design a self-supervised metric learning to generate more reliable cluster assignment to boost the conditional diffusion generation. Extensive experimental results on three benchmarks validate the effectiveness and advantage of our proposed method over the state-of-the-art methods.
深度聚类作为无监督表示学习的一个重要分支,专注于将语义相似的样本嵌入到相同的特征空间中。这一核心需求引发了对比学习以及子空间聚类的探索。然而,这些解决方案总是依赖于生成模型生成足够且类别平衡的样本来生成有效的高级表示的基本假设。这个假设实际上过于严格,无法满足现实世界的应用需求。为了克服这一挑战,自然策略是利用生成模型来增加大量的实例。然而,如何有效地利用这些新颖样本进行聚类性能的改进仍然很难,并且没有被充分探索。在本文中,我们提出了一种新颖的生成校准聚类(GCC)方法,将特征学习和增强融入聚类过程。首先,我们开发了一个判别特征对齐机制,以发现真实和生成样本之间的内在关系。其次,我们设计了一个自监督的度量学习,以生成更可靠的聚类分配来提高条件扩散生成。在三个基准测试上进行的大量实验结果证实了与最先进方法相比,我们提出的方法的有效性和优势。
https://arxiv.org/abs/2404.09115
New Intent Discovery (NID) strives to identify known and reasonably deduce novel intent groups in the open-world scenario. But current methods face issues with inaccurate pseudo-labels and poor representation learning, creating a negative feedback loop that degrades overall model performance, including accuracy and the adjusted rand index. To address the aforementioned challenges, we propose a Robust New Intent Discovery (RoNID) framework optimized by an EM-style method, which focuses on constructing reliable pseudo-labels and obtaining cluster-friendly discriminative representations. RoNID comprises two main modules: reliable pseudo-label generation module and cluster-friendly representation learning module. Specifically, the pseudo-label generation module assigns reliable synthetic labels by solving an optimal transport problem in the E-step, which effectively provides high-quality supervised signals for the input of the cluster-friendly representation learning module. To learn cluster-friendly representation with strong intra-cluster compactness and large inter-cluster separation, the representation learning module combines intra-cluster and inter-cluster contrastive learning in the M-step to feed more discriminative features into the generation module. RoNID can be performed iteratively to ultimately yield a robust model with reliable pseudo-labels and cluster-friendly representations. Experimental results on multiple benchmarks demonstrate our method brings substantial improvements over previous state-of-the-art methods by a large margin of +1~+4 points.
新意图发现(NID)旨在在开放世界场景中识别已知的和合理的推断出新的意图组。然而,目前的 methods 面临准确伪标签和差的学习表示的挑战,导致负反馈循环,降低整体模型性能,包括准确性和调整的均方差。为解决上述挑战,我们提出了一个通过EM风格方法优化的鲁棒新意图发现(RoNID)框架。该框架关注于构建可靠的伪标签和获得聚类友好的表示。RoNID 由两个主要模块组成:可靠伪标签生成模块和聚类友好表示学习模块。具体来说,伪标签生成模块通过求解优化传输问题在 E 步骤,从而有效地为聚类友好表示学习模块提供高质量的超监督信号。为了通过强烈的聚类内紧凑性和大的跨聚类分离学习聚类友好表示,表示学习模块在 M 步骤结合了聚类内和跨聚类对比学习,将更多有区别的特征提供给生成模块。RoNID 可以递归执行,最终得到一个具有可靠伪标签和聚类友好表示的稳健模型。在多个基准测试上进行的实验结果表明,我们的方法在很大程度上超过了以前的最先进方法。
https://arxiv.org/abs/2404.08977
A significant amount of society's infrastructure can be modeled using graph structures, from electric and communication grids, to traffic networks, to social networks. Each of these domains are also susceptible to the cascading spread of negative impacts, whether this be overloaded devices in the power grid or the reach of a social media post containing misinformation. The potential harm of a cascade is compounded when considering a malicious attack by an adversary that is intended to maximize the cascading impact. However, by exploiting knowledge of the cascading dynamics, targets with the largest cascading impact can be preemptively prioritized for defense, and the damage an adversary can inflict can be mitigated. While game theory provides tools for finding an optimal preemptive defense strategy, existing methods struggle to scale to the context of large graph environments because of the combinatorial explosion of possible actions that occurs when the attacker and defender can each choose multiple targets in the graph simultaneously. The proposed method enables a data-driven deep learning approach that uses multi-node representation learning and counterfactual data augmentation to generalize to the full combinatorial action space by training on a variety of small restricted subsets of the action space. We demonstrate through experiments that the proposed method is capable of identifying defense strategies that are less exploitable than SOTA methods for large graphs, while still being able to produce strategies near the Nash equilibrium for small-scale scenarios for which it can be computed. Moreover, the proposed method demonstrates superior prediction accuracy on a validation set of unseen cascades compared to other deep learning approaches.
大量社会的设施可以使用图结构进行建模,从电力和通信网络,到交通网络,到社会网络。每个领域也容易受到级联负面影响的传播,无论是电网中的过载设备,还是包含不准确信息的社交媒体帖子,其影响范围。当考虑恶意攻击方旨在最大化级联影响时,级联潜在危害的叠加效果会变得更加严重。然而,通过利用级联动态的知识,可以预先优先考虑具有最大级联影响的目标进行防御,并减轻攻击者可能造成的损害。虽然游戏理论提供了找到最优先发防御策略的工具,但现有方法在大型图环境中因为攻击者和防御者可以同时选择 graph 上多个目标而出现组合爆炸问题而难以扩展。所提出的方法采用数据驱动的深度学习方法,利用多节点表示学习和反事实数据增强来泛化到完整的组合动作空间,通过在各种小限制子集的action space上训练,实现对动作空间的泛化。我们通过实验证明,与现有的对于大型图的SOTA(最高水平)方法相比,所提出的方法能够识别出更难以被利用的防御策略,同时,对于可以计算的小规模场景,该方法仍然可以生成接近纳什均衡的战略。此外,与其它深度学习方法相比,所提出的方法在验证集中对未见过的级联的预测准确性具有优势。
https://arxiv.org/abs/2404.14418
To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-\allowbreak super-\allowbreak vised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. From a theoretical angle, we find that MIM disentangles neurons in latent space, a property that has been suggested to structure visual representations in primates, without explicit regulation. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under this https URL
为了理解他们周围的环境,智能系统必须将复杂的感官输入转换为结构化的代码,以便精简为任务相关的信息,如物体类别。生物智能体在很大程度上是自适应的,可能是通过自监督学习中的self-allowbreak super-allowbreak visibility学习来实现的。而之前尝试建模底层机制的尝试在很大程度上是歧视性的,证据表明,大脑采用了一种生成型的世界模型。在这里,我们提出,眼动和灵长类视觉集中精力的事实构成了一个生成、自监督的任务,预测和揭示视觉信息。我们从一个通用的图像建模(MIM)框架开始构建证明原则的模型,这是深度表示学习中的常见方法。为此,我们分析MIM中核心组件如遮罩技术和数据增强如何影响类别特定表示的形成。这使我们不仅能够更好地理解MIM的原理,而且能够重新构建一个更符合生物感知聚焦特点的MIM。从理论角度来看,我们发现MIM解离了潜在空间中的神经元,这种性质在灵长类动物中建议了视觉表示的结构。结合之前的惯性学习发现,这揭示了MIM与自监督学习中潜在规范方法之间的有趣联系。源代码可以在https://这个 URL上找到。
https://arxiv.org/abs/2404.08526