Imaging sites around the world generate growing amounts of medical scan data with ever more versatile and affordable technology. Large-scale studies acquire MRI for tens of thousands of participants, together with metadata ranging from lifestyle questionnaires to biochemical assays, genetic analyses and more. These large datasets encode substantial information about human health and hold considerable potential for machine learning training and analysis. This chapter examines ongoing large-scale studies and the challenge of distribution shifts between them. Transfer learning for overcoming such shifts is discussed, together with federated learning for safe access to distributed training data securely held at multiple institutions. Finally, representation learning is reviewed as a methodology for encoding embeddings that express abstract relationships in multi-modal input formats.
在世界各地的成像站点生成越来越多的医疗扫描数据,并且使用越来越多样化和经济实惠的技术。大型研究项目可以获取数十万参与者的MRI数据,以及从生活方式问卷到生物化学检测、遗传分析等元数据。这些大型数据集编码了有关人类健康的大量信息,具有很大的机器学习训练和分析潜力。本章审查了正在进行的大型研究项目和它们之间的分布转移挑战。讨论了转移学习来克服这种转移,以及分散式学习安全地存储在多个机构上的分布式训练数据的访问。最后,对表示学习作为表示多模态输入格式中抽象关系的一种方法进行了回顾。
https://arxiv.org/abs/2404.14326
Trajectory modeling refers to characterizing human movement behavior, serving as a pivotal step in understanding mobility patterns. Nevertheless, existing studies typically ignore the confounding effects of geospatial context, leading to the acquisition of spurious correlations and limited generalization capabilities. To bridge this gap, we initially formulate a Structural Causal Model (SCM) to decipher the trajectory representation learning process from a causal perspective. Building upon the SCM, we further present a Trajectory modeling framework (TrajCL) based on Causal Learning, which leverages the backdoor adjustment theory as an intervention tool to eliminate the spurious correlations between geospatial context and trajectories. Extensive experiments on two real-world datasets verify that TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.
轨迹建模指的是描述人类运动行为的过程,它在理解移动模式中是一个关键的步骤。然而,现有研究通常忽视地理上下文的影响,导致获得伪相关性和有限的泛化能力。为了弥合这一空白,我们首先提出了一个结构因果模型(SCM),从因果视角解码轨迹表示学习过程。在此基础上,我们进一步提出了一个基于因果学习的轨迹建模框架(TrajCL),该框架利用后门调整理论作为一种干预工具来消除地理上下文与轨迹之间的伪相关性。在两个真实世界数据集上的广泛实验证实,TrajCL在轨迹分类任务中的性能明显增强,同时表现出卓越的泛化和可解释性。
https://arxiv.org/abs/2404.14073
Most works studying representation learning focus only on classification and neglect regression. Yet, the learning objectives and therefore the representation topologies of the two tasks are fundamentally different: classification targets class separation, leading to disconnected representations, whereas regression requires ordinality with respect to the target, leading to continuous representations. We thus wonder how the effectiveness of a regression representation is influenced by its topology, with evaluation based on the Information Bottleneck (IB) principle. The IB principle is an important framework that provides principles for learning effectiveness representations. We establish two connections between it and the topology of regression representations. The first connection reveals that a lower intrinsic dimension of the feature space implies a reduced complexity of the representation Z. This complexity can be quantified as the conditional entropy of Z on the target space Y and serves as an upper bound on the generalization error. The second connection suggests learning a feature space that is topologically similar to the target space will better align with the IB principle. Based on these two connections, we introduce PH-Reg, a regularizer specific to regression that matches the intrinsic dimension and topology of the feature space with the target space. Experiments on synthetic and real-world regression tasks demonstrate the benefits of PH-Reg.
大多数研究代表学习的研究仅关注分类,而忽视了回归。然而,学习目标因此两个任务的表示拓扑是根本不同的:分类目标是分类,导致离散表示,而回归需要目标空间的有序性,导致连续表示。因此,我们想知道回归表示的有效性如何受到其拓扑结构的影响,评估基于信息瓶颈(IB)原理。IB原理是一个重要的学习有效性表示的框架。我们建立了它与回归表示拓扑之间的两个联系。第一个联系揭示了特征空间内特征维度较低意味着表示Z的复杂性降低。这种复杂性可以定量为目标空间Y的条件熵,作为一般化误差的上界。第二个联系建议学习一个与目标空间拓扑结构相似的特征空间将更好地符合IB原理。基于这些两个联系,我们引入了PH-Reg,一个特定于回归的 regularizer,它与特征空间的内维度和拓扑结构与目标空间的拓扑结构相匹配。在合成和现实世界的回归任务上的实验证明了PH-Reg的好处。
https://arxiv.org/abs/2404.13904
Recent techniques on implicit geometry representation learning and neural rendering have shown promising results for 3D clothed human reconstruction from sparse video inputs. However, it is still challenging to reconstruct detailed surface geometry and even more difficult to synthesize photorealistic novel views with animated human poses. In this work, we introduce PGAHum, a prior-guided geometry and appearance learning framework for high-fidelity animatable human reconstruction. We thoroughly exploit 3D human priors in three key modules of PGAHum to achieve high-quality geometry reconstruction with intricate details and photorealistic view synthesis on unseen poses. First, a prior-based implicit geometry representation of 3D human, which contains a delta SDF predicted by a tri-plane network and a base SDF derived from the prior SMPL model, is proposed to model the surface details and the body shape in a disentangled manner. Second, we introduce a novel prior-guided sampling strategy that fully leverages the prior information of the human pose and body to sample the query points within or near the body surface. By avoiding unnecessary learning in the empty 3D space, the neural rendering can recover more appearance details. Last, we propose a novel iterative backward deformation strategy to progressively find the correspondence for the query point in observation space. A skinning weights prediction model is learned based on the prior provided by the SMPL model to achieve the iterative backward LBS deformation. Extensive quantitative and qualitative comparisons on various datasets are conducted and the results demonstrate the superiority of our framework. Ablation studies also verify the effectiveness of each scheme for geometry and appearance learning.
近年来,在隐式几何表示学习和神经渲染方面的技术已经为从稀疏视频输入中实现3D带有人物的重建带来了鼓舞人心的结果。然而,从零散的视频输入中重构详细表面的挑战仍然很大,而且用动画人体姿势合成逼真的新视角也更为困难。在这项工作中,我们引入了PGAHum,一个高保真度人体建模先验指导的形状和外观学习框架。我们在PGAHum的三个关键模块中充分利用了3D人体先验信息,以实现高质量几何建模和逼真的新视图合成。首先,我们提出了一个基于先验的人体隐式几何表示,其中包含由三平面网络预测的delta SDF和基于先验SMPL模型的基础SDF,以以分离的方式建模表面细节和人体形状。其次,我们引入了一种新颖的基于先验的人体姿态和身体先验引导的采样策略,充分利用人体姿态和身体先验信息在体表附近采样查询点。通过避免在空3D空间中的无用学习,神经渲染可以恢复更多的外观细节。最后,我们提出了一种新颖的迭代反向变形策略,在观察空间中逐步找到查询点的对应关系。基于SMPL模型的先验提供了 skinning权重预测模型,以实现迭代反向LBS变形。在各种数据集上进行了广泛的定量和定性比较,结果证明了我们的框架的优势。消融研究还证实了每个方案在形状和外观学习方面的有效性。
https://arxiv.org/abs/2404.13862
Conditional independence (CI) constraints are critical for defining and evaluating fairness in machine learning, as well as for learning unconfounded or causal representations. Traditional methods for ensuring fairness either blindly learn invariant features with respect to a protected variable (e.g., race when classifying sex from face images) or enforce CI relative to the protected attribute only on the model output (e.g., the sex label). Neither of these methods are effective in enforcing CI in high-dimensional feature spaces. In this paper, we focus on a nascent approach characterizing the CI constraint in terms of two Jensen-Shannon divergence terms, and we extend it to high-dimensional feature spaces using a novel dynamic sampling strategy. In doing so, we introduce a new training paradigm that can be applied to any encoder architecture. We are able to enforce conditional independence of the diffusion autoencoder latent representation with respect to any protected attribute under the equalized odds constraint and show that this approach enables causal image generation with controllable latent spaces. Our experimental results demonstrate that our approach can achieve high accuracy on downstream tasks while upholding equality of odds.
条件独立性(CI)约束对于定义和评估机器学习中的公平性以及学习无偏或因果表示至关重要。传统方法要么盲目地学习与受保护变量相关的不变特征(例如,从面部图像中分类性别时,以种族为例),要么仅在模型输出上应用CI(例如,性别标签)。然而,这些方法在高维特征空间中实施CI均无效。在本文中,我们关注于一种新兴的方法,该方法以两个Jensen-Shannon熵项描述CI约束,并将其扩展到高维特征空间。通过使用一种新颖的动态采样策略,我们在等价机会约束下实现扩散自编码器潜在表示的联合独立性。我们还展示了这种方法能够实现具有可控制 latent 空间的可控因果图像生成。我们的实验结果表明,在保持等价机会的同时,我们的方法可以在下游任务上实现高准确度。
https://arxiv.org/abs/2404.13798
Graph neural networks (GNNs) have revolutionized the field of machine learning on non-Euclidean data such as graphs and networks. GNNs effectively implement node representation learning through neighborhood aggregation and achieve impressive results in many graph-related tasks. However, most neighborhood aggregation approaches are summation-based, which can be problematic as they may not be sufficiently expressive to encode informative graph structures. Furthermore, though the graph pooling module is also of vital importance for graph learning, especially for the task of graph classification, research on graph down-sampling mechanisms is rather limited. To address the above challenges, we propose a concatenation-based graph convolution mechanism that injectively updates node representations to maximize the discriminative power in distinguishing non-isomorphic subgraphs. In addition, we design a novel graph pooling module, called WL-SortPool, to learn important subgraph patterns in a deep-learning manner. WL-SortPool layer-wise sorts node representations (i.e. continuous WL colors) to separately learn the relative importance of subtrees with different depths for the purpose of classification, thus better characterizing the complex graph topology and rich information encoded in the graph. We propose a novel Subgraph Pattern GNN (SPGNN) architecture that incorporates these enhancements. We test the proposed SPGNN architecture on many graph classification benchmarks. Experimental results show that our method can achieve highly competitive results with state-of-the-art graph kernels and other GNN approaches.
图形神经网络(GNNs)在非欧氏数据(如图形和网络)领域已经颠覆了机器学习。GNNs通过聚类和实现节点表示学习有效地实现了节点表示学习,并在许多图形相关任务中取得了令人印象深刻的成果。然而,大多数聚类方法是基于求和的,这可能会有问题,因为他们可能不足以编码有用的图形结构。此外,尽管图形池化模块对于图形学习(尤其是图形分类)也非常重要,但关于图形 down-sampling机制的研究仍然相当有限。为了应对上述挑战,我们提出了一个基于连接的图形卷积机制,通过注入式更新节点表示以最大化区分类别的 discriminative power。此外,我们还设计了一个名为WL-SortPool的新颖图形池化模块,以在深度学习的方式学习中学习重要的子图模式。WL-SortPool对节点表示(即连续的WL颜色)进行层间排序,以分别学习具有不同深度的子树之间的相对重要性,从而更好地描述复杂的图形拓扑结构和图中所编码的丰富信息。我们提出了一个包含这些增强的全新的子图模式图形神经网络(SPGNN)架构。我们在许多图形分类基准上测试了所提出的SPGNN架构。实验结果表明,我们的方法可以与最先进的图形核和其他GNN方法一样实现高度竞争性的结果。
https://arxiv.org/abs/2404.13655
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
机器学习模型已经取得了巨大的进展,但在应用到未见过的领域时,它们仍然存在困难。本研究关注于领域泛化问题,即在训练过程中,模型学习一个未见过的领域,而在测试过程中,对多个未见过的领域进行测试。我们提出IMO:Invariant features Masks for Out-of-Distribution text classification,通过学习不变的特征来实现OOD泛化。在训练过程中,IMO会学习稀疏的掩码层,用于消除预测过程中的无关特征,而剩余的特征保持不变。此外,IMO在词级别有一个注意力模块,专注于对预测有用的词进行关注。我们的全面实验结果表明,IMO在各种评估指标和设置方面都显著优于强大的基线。
https://arxiv.org/abs/2404.13504
The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.
深度学习革命对诸如风格/领域转移、增强/修复和视觉质量评估等低级图像处理任务产生了强烈影响。尽管这些任务通常被单独处理,但前述任务都 share a common theme of understanding、editing或增强输入图像的视觉效果,而不会修改底层内容。我们利用这个观察结果开发了一种新颖的解耦表示学习方法,将输入分解为内容和外观特征。该模型以自监督的方式进行训练,并使用学习到的特征开发了一个名为DisQUE的新质量预测模型。我们在广泛的评估中证明了DisQUE在质量预测任务和失真类型上的最先进准确度。此外,我们还证明了相同特征还可以用于图像处理任务,如 HDR 色调映射,其中所需的输出特性可以通过使用示例输入-输出对进行调整。
https://arxiv.org/abs/2404.13484
Ontology matching is defined as finding a relationship or correspondence between two or more entities in two or more ontologies. To solve the interoperability problem of the domain ontologies, semantically similar entities in these ontologies must be found and aligned before merging them. GraphMatcher, developed in this study, is an ontology matching system using a graph attention approach to compute higher-level representation of a class together with its surrounding terms. The GraphMatcher has obtained remarkable results in in the Ontology Alignment Evaluation Initiative (OAEI) 2022 conference track. Its codes are available at ~\url{this https URL}.
语义匹配是一种在两个或多个语义网之间查找关系或对应关系的任务。为了解决领域语义网之间的互操作性问题,本研究开发了一种基于图注意力的语义匹配系统,用于计算类及其周围术语的高级表示。GraphMatcher在2022年Ontology Alignment Evaluation Initiative(OAEI)会议跟踪中取得了显著的成果。其代码可在此处下载:https://this https URL。
https://arxiv.org/abs/2404.14450
Decoding visual information from human brain activity has seen remarkable advancements in recent research. However, due to the significant variability in cortical parcellation and cognition patterns across subjects, current approaches personalized deep models for each subject, constraining the practicality of this technology in real-world contexts. To tackle the challenges, we introduce Wills Aligner, a robust multi-subject brain representation learner. Our Wills Aligner initially aligns different subjects' brains at the anatomical level. Subsequently, it incorporates a mixture of brain experts to learn individual cognition patterns. Additionally, it decouples the multi-subject learning task into a two-stage training, propelling the deep model and its plugin network to learn inter-subject commonality knowledge and various cognition patterns, respectively. Wills Aligner enables us to overcome anatomical differences and to efficiently leverage a single model for multi-subject brain representation learning. We meticulously evaluate the performance of our approach across coarse-grained and fine-grained visual decoding tasks. The experimental results demonstrate that our Wills Aligner achieves state-of-the-art performance.
近年来,从人脑活动解读视觉信息的研究取得了显著的进展。然而,由于不同受试者之间皮质分叶和认知模式的重大差异,为每个受试者定制深度模型在现实场景中限制了技术的实用性。为了解决这些挑战,我们引入了Wills Aligner,一个 robust 的多subject brain representation learner。 我们的Wills Aligner首先在解剖层面上对不同受试者的的大脑进行对齐。然后,它结合了多位脑专家来学习个体认知模式。此外,它将多subject学习任务转化为两个阶段的训练,推动深度模型及其插件网络学习跨受试者共性知识和各种认知模式。Wills Aligner使我们能够克服解剖差异,并有效地利用单个模型进行多subject brain representation learning。 我们详细评估了我们的方法在粗粒度和细粒度视觉解码任务上的性能。实验结果表明,我们的Wills Aligner达到了最先进的水平。
https://arxiv.org/abs/2404.13282
Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim's enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim's capabilities, which reveals that Vim uniquely emulates the pathologist workflow-unlike ViT. This alignment with human expert analysis highlights Vim's potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at \url{this https URL}.
从Gigapixel Whole Slide Images(WSI)中进行表示学习在计算病理学中是一个具有重大挑战性的问题,因为组织结构的复杂性和标注数据的稀疏性。多实例学习方法已经解决了这个挑战,通过利用图像补丁对预训练模型进行分类,利用自监督学习(SSL)方法实现。 SSL和MIL方法的表现都依赖于特征编码器的架构。本文提出利用Vision Mamba(Vim)架构,受到状态空间模型的启发,在DINO框架中进行计算病理学中代表学习的建议。我们在Camelyon16数据集上评估Vim与Vision Transformers(ViT)的性能,包括补丁水平和滑动级别分类。我们的研究结果表明,与ViT相比,Vim在较小规模上表现出色,特别是在较小规模上,Vim的ROC AUC模型大小增加了8.21。可解释性分析进一步强调了Vim的功能,揭示了Vim独特地模拟了病理学家工作流程,类似于ViT。这种与人类专家分析的 alignment 突出了Vim在实际诊断场景中的潜在能力,并显著地促进了开发有效的计算病理学中的表示学习算法。我们发布了代码和预训练权重在this链接处。
https://arxiv.org/abs/2404.13222
Sequential recommendation is dedicated to offering items of interest for users based on their history behaviors. The attribute-opinion pairs, expressed by users in their reviews for items, provide the potentials to capture user preferences and item characteristics at a fine-grained level. To this end, we propose a novel framework FineRec that explores the attribute-opinion pairs of reviews to finely handle sequential recommendation. Specifically, we utilize a large language model to extract attribute-opinion pairs from reviews. For each attribute, a unique attribute-specific user-opinion-item graph is created, where corresponding opinions serve as the edges linking heterogeneous user and item nodes. To tackle the diversity of opinions, we devise a diversity-aware convolution operation to aggregate information within the graphs, enabling attribute-specific user and item representation learning. Ultimately, we present an interaction-driven fusion mechanism to integrate attribute-specific user/item representations across all attributes for generating recommendations. Extensive experiments conducted on several realworld datasets demonstrate the superiority of our FineRec over existing state-of-the-art methods. Further analysis also verifies the effectiveness of our fine-grained manner in handling the task.
序列推荐旨在根据用户的浏览历史行为提供感兴趣的物品。用户在物品评论中表达的属性-意见对提供了捕捉用户偏好和物品特征的细粒度可能性。为此,我们提出了FineRec框架,该框架探索了评论中的属性-意见对,以细粒度处理序列推荐。具体来说,我们利用一个大语言模型从评论中提取属性-意见对。对于每个属性,创建一个独特的属性特定用户-意见-物品图,其中相应的意见作为连接异质用户和物品节点的边。为解决不同意见的多样性,我们设计了一个多样性感知卷积操作,用于汇总图中的信息,实现属性特定的用户和物品表示学习。最后,我们提出了一个交互式融合机制,将所有属性的属性特定用户/物品表示集成到生成推荐中。在多个现实世界数据集上进行的实验证实了我们的FineRec框架在现有技术水平上具有优越性。进一步的分析还证实了我们在处理任务上的细粒度方法的有效性。
https://arxiv.org/abs/2404.12975
We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are unconditional models that do not integrate scene or other contextual information. As a consequence, they have limited applicability and rely on costly training data. To address these limitations, we propose a new method ,dubbed Purposer, based on neural discrete representation learning. Our model is capable of exploiting, in a flexible manner, different types of information already present in open access large-scale datasets such as AMASS. First, we encode unconditional human motion into a discrete latent space. Second, an autoregressive generative model, conditioned with key contextual information, either with prompting or additive tokens, and trained for next-step prediction in this space, synthesizes sequences of latent indices. We further design a novel conditioning block to handle future conditioning information in such a causal model by using a network with two branches to compute separate stacks of features. In this manner, Purposer can generate realistic motion sequences in diverse test scenes. Through exhaustive evaluation, we demonstrate that our multi-contextual solution outperforms existing specialized approaches for specific contextual information, both in terms of quality and diversity. Our model is trained with short sequences, but a byproduct of being able to use various conditioning signals is that at test time different combinations can be used to chain short sequences together and generate long motions within a context scene.
我们提出了一种名为Purposer的新方法,基于神经离散表示学习。我们的模型能够以灵活的方式利用开放访问的大规模数据集AMASS中已经存在的不同类型的信息。首先,我们将无条件的人类运动编码到一个离散的潜在空间中。然后,一个条件生成模型,通过关键的上下文信息条件,以提示或添加标记的方式进行训练,并在该空间中进行下一步预测,合成了一系列的潜在索引。我们进一步设计了一个新的条件模块,用于在具有因果关系的模型中处理未来的条件信息,通过使用具有两个分支的网络计算不同的特征栈。这样,Purposer可以在各种测试场景中生成逼真的运动序列。通过彻底的评估,我们证明了我们的多上下文解决方案在特定上下文信息方面的现有专业方法中具有优越性,无论是质量还是多样性。我们的模型使用短序列进行训练,但能够使用各种上下文信号的原因是,在测试时可以使用不同的组合将短序列串联起来并在上下文场景中生成长动作。
https://arxiv.org/abs/2404.12942
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
In this paper, we propose a new Multimodal Representation Learning (MRL) method for Multimodal Sentiment Analysis (MSA), which facilitates the adaptive interaction between modalities through Cooperative Sentiment Agents, named Co-SA. Co-SA comprises two critical components: the Sentiment Agents Establishment (SAE) phase and the Sentiment Agents Cooperation (SAC) phase. During the SAE phase, each sentiment agent deals with an unimodal signal and highlights explicit dynamic sentiment variations within the modality via the Modality-Sentiment Disentanglement (MSD) and Deep Phase Space Reconstruction (DPSR) modules. Subsequently, in the SAC phase, Co-SA meticulously designs task-specific interaction mechanisms for sentiment agents so that coordinating multimodal signals to learn the joint representation. Specifically, Co-SA equips an independent policy model for each sentiment agent that captures significant properties within the modality. These policies are optimized mutually through the unified reward adaptive to downstream tasks. Benefitting from the rewarding mechanism, Co-SA transcends the limitation of pre-defined fusion modes and adaptively captures unimodal properties for MRL in the multimodal interaction setting. To demonstrate the effectiveness of Co-SA, we apply it to address Multimodal Sentiment Analysis (MSA) and Multimodal Emotion Recognition (MER) tasks. Our comprehensive experimental results demonstrate that Co-SA excels at discovering diverse cross-modal features, encompassing both common and complementary aspects. The code can be available at this https URL.
在本文中,我们提出了一个新的多模态表示学习(MRL)方法,名为合作情感代理(Co-SA),用于多模态情感分析(MSA),并通过合作情感代理促进模态之间的自适应交互。Co-SA包括两个关键组件:情感代理建立(SAE)阶段和情感代理合作(SAC)阶段。在SAE阶段,每个情感代理处理一个单模态信号,并通过模态情感解离(MSD)和深度时域重构(DPSR)模块在模态内突出显示动态情感变化。然后,在SAC阶段,Co-SA精心设计了一系列任务特定的情感代理交互机制,以协调多模态信号以学习联合表示。具体来说,Co-SA为每个情感代理配备了一个独立的政策模型,该模型捕捉模态内的显著属性。这些策略通过统一奖励适应下游任务进行优化。得益于奖励机制,Co-SA超越了预定义的融合模式,并适应了多模态交互设置中的情感代理学习(MRL)。为了证明Co-SA的有效性,我们将它应用于情感多模态分析和情感识别任务。我们全面的实验结果表明,Co-SA在发现跨模态特征方面表现出色,涵盖模态共性和互补性的各个方面。代码可以从该链接获取。
https://arxiv.org/abs/2404.12642
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
自监督学习(SSL)作为一种无需标注的学习技术,在医学图像分析领域呈现出巨大的潜力。然而,尽管具有潜在的积极影响,传统的 SSL 方法也存在局限性,包括在实现语义对齐和捕捉细微细节方面遇到的挑战。这导致 suboptimal 表示,无法准确捕捉到解剖学结构和病理细节。为了应对这些限制,我们引入了一个名为 OPTiML 的新 SSL 框架,采用最优传输(OT)技术,以捕捉密集的语义不变性和细粒度细节,从而增强 SSL 在医学图像表示学习中的整体效果。核心思想是将 OT 与跨视点语义注入模块(CV-SIM)相结合,有效地捕捉不同观点下医学图像中复杂、细粒度的细节。除了 CV-SIM 模块之外,OPTiML 对 OT 框架内的方差和协方差进行正则化,以迫使模型将注意力集中在临床相关信息上,而忽略更不相关的特征。通过这些,所提出的框架展示了其学习语义丰富表示的能力,可以应用于各种医学成像任务。为了验证其有效性,我们在三个公开可用的数据集(包括胸部 X 光摄影模式)上进行了实验研究。我们的实证结果表明,OPTiML 在所有评估任务上都优于最先进的 methods。
https://arxiv.org/abs/2404.11868
In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.
在本文中,我们在半监督学习框架下提出了一种名为提示驱动特征扩散(PDFD)的新方法,用于开放世界半监督学习(OW-SSL)。其核心思想是,PDFD通过类特定提示来指导类级别特征级扩散模型,支持分类特征表示学习和特征生成,解决了OW-SSL中未见类别的标签数据不足的挑战。 具体来说,PDFD利用类原型作为扩散模型的提示,利用它们的分类歧视性和语义泛化能力来对所有可见和不可见类别的扩散过程进行条件和引导。此外,PDFD引入了分类条件 adversarial loss for diffusion model training,确保通过扩散过程生成的特征与真实数据的类条件特征对齐。 另外,类原型的计算仅在半监督学习框架中使用具有自信预测的未标注实例。我们通过广泛的实验评估了所提出的PDFD。实验结果表明,与现有方法相比,PDFD具有显著的性能增强。
https://arxiv.org/abs/2404.11795
This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.
本文重点探讨了通过探索泛化界和表示学习来降低联邦学习中的通信成本。首先,我们基于局部客户端的泛化能力和数据分布异质性(非iid场景)定义了一个更紧的泛化界。然后,我们在R轮联邦学习和其与本地更新数量的关系上进行了定义。基于我们对泛化界分析的推理和表示学习的解释,我们证明了表示提取器(通常对应于初始层)进行更少的聚合会导致创建更具有泛化能力的模型,尤其是在非iid场景中。我们基于泛化界和表示学习分析设计了一种名为FedALS的新联邦学习算法。FedALS采用不同的聚合频率来对模型的不同部分进行动态调整,从而降低了通信成本。本文附有实验结果,展示了FedALS的有效性。
https://arxiv.org/abs/2404.11754
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
流行的表示学习方法鼓励在应用于输入时的变换下保持特征的不变性。然而,在像物体定位和分割这样的3D感知任务中,输出自然地对某些变换(例如旋转)具有等价性。通过使用鼓励在某些变换下保持特征等价的预训练损失函数,可以提供强大的自监督信号,同时保留变换前特征表示之间几何关系的信息。这可以提高在下游具有这种变换的任务的性能。在本文中,我们提出了一种空间和时间等价的表示学习框架,通过同时考虑空间和时间增强。我们的实验表明,最佳性能通过鼓励对平移、缩放和翻转、旋转和场景流动的等价性来实现。对于空间增强,我们发现,根据变换,无论是对比性目标还是类比目标都能获得最佳结果。为了利用真实的物体变形和运动,我们考虑了连续的激光雷达场景对,并开发了一个新的基于3D场景流的三等价目标,这使得整体性能得到提高。我们证明了我们的预训练方法在许多设置中优于现有的等价和不变方法。
https://arxiv.org/abs/2404.11737
Change detection aims to identify remote sense object changes by analyzing data between bitemporal image pairs. Due to the large temporal and spatial span of data collection in change detection image pairs, there are often a significant amount of task-specific and task-agnostic noise. Previous effort has focused excessively on denoising, with this goes a great deal of loss of fine-grained information. In this paper, we revisit the importance of fine-grained features in change detection and propose a series of operations for fine-grained information compensation and noise decoupling (FINO). First, the context is utilized to compensate for the fine-grained information in the feature space. Next, a shape-aware and a brightness-aware module are designed to improve the capacity for representation learning. The shape-aware module guides the backbone for more precise shape estimation, guiding the backbone network in extracting object shape features. The brightness-aware module learns a overall brightness estimation to improve the model's robustness to task-agnostic noise. Finally, a task-specific noise decoupling structure is designed as a way to improve the model's ability to separate noise interference from feature similarity. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in multiple change detection benchmarks. The code will be made available.
变化检测旨在通过分析位图图像对之间的数据来识别远程感物体更改。由于数据收集的变化检测图像对具有较大的时间和空间范围,因此通常存在大量与任务特定和任务无关的噪声。之前的工作主要集中在去噪,这导致了很多细节信息的丢失。在本文中,我们重新强调了在变化检测中关注细粒度特征的重要性,并提出了细粒度信息补偿和噪声解耦(FINO)的一系列操作。首先,在特征空间中利用上下文进行细粒度信息的补偿。然后,设计了一个形状感知和一个亮度感知模块,以提高表示学习的能力。形状感知模块指导网络进行更精确的形状估计,引导网络提取物体形状特征。亮度感知模块学习一个总的亮度估计,以提高模型对任务无关噪声的鲁棒性。最后,为了提高模型将噪声干扰与特征相似性区分开的能力,设计了一种任务特定的噪声解耦结构。通过这些训练方案,我们提出的方法在多个变化检测基准测试中实现了最先进的(SOTA)结果。代码将公开提供。
https://arxiv.org/abs/2404.11318