In recent studies on MRI reconstruction, advances have shown significant promise for further accelerating the MRI acquisition. Most state-of-the-art methods require a large amount of fully-sampled data to optimise reconstruction models, which is impractical and expensive under certain clinical settings. On the other hand, for unsupervised scan-specific reconstruction methods, overfitting is likely to happen due to insufficient supervision, while restrictions on acceleration rates and under-sampling patterns further limit their applicability. To this end, we propose an unsupervised, adaptive coarse-to-fine framework that enhances reconstruction quality without being constrained by the sparsity levels or patterns in under-sampling. The framework employs an implicit neural representation for scan-specific MRI reconstruction, learning a mapping from multi-dimensional coordinates to their corresponding signal intensities. Moreover, we integrate a novel learning strategy that progressively refines the use of acquired k-space signals for self-supervision. This approach effectively adjusts the proportion of supervising signals from unevenly distributed information across different frequency bands, thus mitigating the issue of overfitting while improving the overall reconstruction. Comprehensive evaluation on a public dataset, including both 2D and 3D data, has shown that our method outperforms current state-of-the-art scan-specific MRI reconstruction techniques, for up to 8-fold under-sampling.
在最近的研究中,MRI重建取得了显著的进展,进一步加速了MRI采集。大多数最先进的方法需要大量完全采样数据来优化重建模型,这在某些临床设置下是不切实际的,而且代价昂贵。另一方面,对于无监督的扫描特定重建方法,由于监督不足,过拟合很可能发生,而限制加速率和欠采样模式进一步限制了它们的适用性。因此,我们提出了一个无监督、自适应的粗到细框架,该框架不会受到欠采样级数或模式的限制。该框架使用了一个隐式神经表示来进行扫描特定的MRI重建,从多维坐标到相应的信号强度学习映射。此外,我们还集成了一种新的学习策略,该策略逐渐优化了已获得的k空间信号的自监督使用。这种方法有效地调整了不同频率带之间信息分布不均匀的情况下监督信号的比例,从而缓解了过拟合的问题,同时提高了整体重建。在公开数据集上进行全面的评估,包括2D和3D数据,已经证明了我们的方法在失真量可以达到8倍的情况下优于当前的扫描特定MRI重建技术。
https://arxiv.org/abs/2312.00677
Unsupervised relation extraction (URE) aims to extract relations between named entities from raw text without requiring manual annotations or pre-existing knowledge bases. In recent studies of URE, researchers put a notable emphasis on contrastive learning strategies for acquiring relation representations. However, these studies often overlook two important aspects: the inclusion of diverse positive pairs for contrastive learning and the exploration of appropriate loss functions. In this paper, we propose AugURE with both within-sentence pairs augmentation and augmentation through cross-sentence pairs extraction to increase the diversity of positive pairs and strengthen the discriminative power of contrastive learning. We also identify the limitation of noise-contrastive estimation (NCE) loss for relation representation learning and propose to apply margin loss for sentence pairs. Experiments on NYT-FB and TACRED datasets demonstrate that the proposed relation representation learning and a simple K-Means clustering achieves state-of-the-art performance.
无监督关系提取(URE)旨在从原始文本中提取命名实体之间的关系,而不需要手动注释或预先存在的知识库。在URE recent studies中,研究人员对获得关系表示的对比学习策略给予了显著的关注。然而,这些研究往往忽视了两个重要的方面:包括对比学习中的多样正对和探索适当损失函数。在本文中,我们提出了一种增加积极对对多样性,并加强对比学习效果的方法:在句子内对成对进行增强,并通过跨句子对提取进行增强。我们还指出了NCE损失函数在关系表示学习中的局限性,并提出使用边缘损失来处理句子对。在NYT-FB和TACRED数据集上的实验表明,所提出的关系表示学习和简单的K-Means聚类达到了最先进的性能水平。
https://arxiv.org/abs/2312.00552
Objective: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Methods: Contrastive Self-Supervised Learning (SSL) offers a potential solution to the scarcity of labeled data as it takes advantage of unlabeled data to increase model effectiveness and robustness. In this research, we propose applying contrastive SSL for detecting abnormalities in phonocardiogram (PCG) samples by learning a generalized representation of the signal. Specifically, we perform an extensive comparative evaluation of a wide range of audio-based augmentations and evaluate trained classifiers on multiple datasets across different downstream tasks. Results: We experimentally demonstrate that, depending on its training distribution, the effectiveness of a fully-supervised model can degrade up to 32% when evaluated on unseen data, while SSL models only lose up to 10% or even improve in some cases. Conclusions: Contrastive SSL pretraining can assist in providing robust classifiers which can generalize to unseen, OOD data, without relying on time- and labor-intensive annotation processes by medical experts. Furthermore, the proposed extensive evaluation protocol sheds light on the most promising and appropriate augmentations for robust PCG signal processing. Significance: We provide researchers and practitioners with a roadmap towards producing robust models for PCG classification, in addition to an open-source codebase for developing novel approaches.
目标:尽管最近研究活动有所增加,但深度学习模型在医学界尚未得到广泛接受。高质量带标签的数据不足往往阻碍了具有稳健和泛化能力的模型的开发,这些模型在接触到新颖且分布不在预期的(OOD)数据时不会出现性能下降。方法:对比自监督学习(SSL)是一种潜在的解决方法,因为它利用未标记数据来提高模型的有效性和稳健性。在这项研究中,我们提出应用对比SSL来检测心电图(PCG)样本中的异常,通过学习信号的一般表示来完成。具体来说,我们在广泛的音频增强技术和多个数据集上进行了深入的比较评估,并在不同的下游任务上评估训练后的分类器。结果:我们通过实验证明了,根据其训练分布,完全监督模型的效果可能会降低32%,而SSL模型只会损失10%或者甚至提高。结论:对比SSL预训练可以帮助提供具有稳健性的分类器,使其能够泛化到未见过的、分布不在预期的数据中。此外,所提出的详细评估协议揭示了最有可能的和适当的增强技术,有助于对PCG信号处理实现更好的泛化。意义:我们为研究人员和实践者提供了一个走向生产具有稳健性的PCG分类模型的路线图,同时提供一个用于开发新方法的开放源代码库。
https://arxiv.org/abs/2312.00502
Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.
监督学习方法已经在估计空间声学参数方面显示出有效性,例如到达时间差、直接-到回波比和回波时间。然而,由于模拟和现实世界声学特性的不匹配以及缺乏注释的现实世界数据,它们仍然受到模拟-现实世界分布问题的困扰。为此,本文提出了一种自监督方法,该方法充分利用未标注数据进行空间声学参数估计。首先,我们设计了一个新的预训练任务——跨通道信号重构(CCSR),旨在从未标注的多通道麦克风信号中学习通用空间声学表示。我们遮盖了一个通道的部分信号,并要求模型重构它们,这使得可以从未标记的信号中学习空间声学信息并提取其他麦克风通道的源信息。采用编码器-解码器结构来分离这两种信息。通过用小注释数据微调预训练的空间编码器,这个编码器可以用于估计空间声学参数。其次,采用一种新颖的多通道音频Conformer(MC-Conformer)作为编码器模型架构,既适用于预处理任务又适用于下游任务。它被仔细设计成能够捕捉在时频域中展示的空间声学局部和全局特征。在模拟和现实世界数据上的五个声学参数估计任务的结果表明,所提出的方法是有效的。据我们所知,这是该领域第一个自监督学习方法。
https://arxiv.org/abs/2312.00476
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.
编程语言理解和表示(也称为代码表示学习)一直是软件工程中的一个热门而具有挑战性的任务。它旨在将深度学习技术应用于生成源代码特征的同时保留其语义。这些表示可用于促进后续的代码相关任务。抽象语法树(AST)是一个基本代码特征,它概述了源代码的语义信息,并在代码表示学习中得到了广泛应用。然而,在如何全面和定量地评估 AST 基于代码表示如何促进后续代码相关任务方面,仍存在缺乏系统性和定量性的评估。在本文中,我们首先进行了全面实证研究,以探讨 AST 基于代码表示在促进后续代码相关任务方面的效果。为此,我们将基于代码标记序列(标记短)的模型训练结果与基于 AST 的代码表示和基于 Token 的代码表示模型在三种流行的代码相关任务上进行比较。令人惊讶的是, overall 的定量统计结果表明,基于 AST 的代码表示模型在所有三个任务上都比基于 Token 的代码表示模型表现得更差。我们的进一步定量分析显示,在所有三个任务中,基于 AST 的代码表示模型在某些数据子集上优于基于 Token 的代码表示模型。我们还进行了全面实验,以评估并揭示选择 AST 解析/预处理/编码方法对基于 AST 的代码表示和后续代码相关任务的影响。我们的研究为未来的研究者提供了在每个阶段选择解决方案以充分利用 AST 的详细指导。
https://arxiv.org/abs/2312.00413
Visual-language pre-training (VLP) have achieved remarkable success in multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that multi-modal large language models (MLLMs) can enhance visual-language representation learning by improving data quality. Our approach is simple, utilizing MLLMs to extend multiple captions for each image. To prevent the bias that introduced by MLLMs' hallucinations and intrinsic caption styles, we propose a "text shearing" to keep the lengths of extended captions identical to the originals. In image-text retrieval, our method consistently obtains 5.6 ~ 35.0% and 16.8 ~ 46.1% improvement on R@1 under the fine-tuning and zero-shot settings, respectively. Notably, our zero-shot results are comparable to fine-tuning on target datasets, which encourages more exploration on the versatile use of MLLMs.
视觉语言预训练(VLP)在多模态任务上的成功很大程度上归功于大型图像-文本数据集的可用性。在这项工作中,我们证明了多模态大型语言模型(MLLMs)可以通过提高数据质量来增强视觉-语言表示学习。我们的方法很简单,利用MLLMs扩展每个图像的多个摘要。为了防止MLLMs的幻觉和固有描述风格带来的偏差,我们提出了一个“文本剪切”来保持扩展摘要的长度与原始相同。在图像-文本检索中,我们的方法在微调和小幅度零样本设置下,分别获得了5.6 ~ 35.0%和16.8 ~ 46.1%的R@1改善。值得注意的是,我们的零样本结果与在目标数据集上的微调结果相当,这鼓励了更加探索MLLMs的多功能应用。
https://arxiv.org/abs/2311.18765
Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.
无监督表示学习的目标是找到一种从数据中学习表示的方法,而无需基于注释的信号。避免注释不仅带来了经济利益,而且可能会 - 以及在一定程度上已经 - 导致关于表示结构、鲁棒性和对不同任务的泛化优势。在长期内,无监督方法预计会超越其监督 counterparts,因为减少了人类干预,并且具有更通用设置,不会将优化引导到特定注释基于信号的目标的优化。 虽然最近在自然语言处理领域观察到了无监督表示学习的主要优势,但在视觉领域,有监督方法仍然在大多数任务中占据主导地位。在这篇论文中,我们从三个角度为无监督(视觉)表示学习做出贡献: (i)学习表示:我们设计了一种无监督、无需反向传播的卷积自组织神经网络(CSNN),利用自组织和 Hebbian 学习规则来学习卷积核和掩码,以实现更深层次的无需反向传播模型。 (ii)评估表示:我们基于广泛使用的(非)线性评估协议定义了预训练和目标对象独立的目标函数不匹配度度量指标,以测量和调查不同无监督预训练任务和目标任务之间的目标函数不匹配。 (iii)迁移表示:我们贡献了 CarlaN,第一个三维模拟到真实领域的迁移基准,以及基于原型自监督学习的方法。最后,我们还贡献了一种内容一致的无配对图像到图像迁移方法,该方法利用掩码、全局和局部判别器以及相似采样来减轻内容不一致性。
https://arxiv.org/abs/2312.00101
Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our Joint Panoptic Part Fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: First, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameter-free and dynamically balances its input. The method is evaluated and compared on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets.
部分感知透视分割是一个计算机视觉问题,旨在在多级粒度上提供对场景的语义理解。更具体地说,同时预测语义区域、目标实例和语义部分。在本文中,我们提出了我们的联合透视部分融合(JPPF),有效地将三个分割结合在一起以获得透视部分分割。其中两个方面对于这个问题尤为重要:第一,希望有一个统一的模型,允许在学习和理解三个问题方面相互改进和一致的表示学习。第二,平衡组合以使它在融合过程中给每个单独结果同等重要性。我们提出的JPPF是参数free的,并且动态地平衡其输入。该方法在Cityscapes透视部分(CPP)和Pascal透视部分(PPP)数据集上进行了评估和比较,以考虑PartPQ和Part-Whole Quality(PWQ)。在广泛的实验中,我们验证了我们对公平融合的重视,强调了其在可以进一步细分的领域中的最显著影响,并展示了我们设计的通用性的能力,无需在5个额外的数据集上进行微调。
https://arxiv.org/abs/2311.18618
Modeling dynamic, large-scale urban scenes is challenging due to their highly intricate geometric structures and unconstrained dynamics in both space and time. Prior methods often employ high-level architectural priors, separating static and dynamic elements, resulting in suboptimal capture of their synergistic interactions. To address this challenge, we present a unified representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon the efficient 3D Gaussian splatting technique, originally designed for static scene representation, by introducing periodic vibration-based temporal dynamics. This innovation enables PVG to elegantly and uniformly represent the characteristics of various objects and elements in dynamic urban scenes. To enhance temporally coherent representation learning with sparse training data, we introduce a novel flow-based temporal smoothing mechanism and a position-aware adaptive control strategy. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes. Notably, PVG achieves this without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG exhibits 50/6000-fold acceleration in training/rendering over the best alternative.
建模动态、大尺度城市场景具有挑战性,因为它们具有高度复杂的几何结构和在空间和时间上的无约束动力学。先前的方法通常采用高级的建筑先验,分离静态和动态元素,导致其协同作用的捕捉效果往往不理想。为了应对这一挑战,我们提出了一个统一的表示模型,称为周期振动高斯(PVG)。PVG在高效3D高斯平铺技术的基础上,引入了周期振动为基础的时间动态。这一创新使得PVG能够优雅且均匀地表示动态城市场景中各种物体和元素的特点。为了通过稀疏训练数据增强时间一致性表示学习,我们引入了一种新的流体为基础的时间平滑机制和位置感知自控制策略。在Waymo Open Dataset和KITTI基准上进行的广泛实验证明,PVG在重构和生成新视图方面都超过了最先进的替代方法,特别是在动态和静态场景。值得注意的是,PVG在没有依赖于手动标记的对象边界框或昂贵的光学流估计的情况下实现这一目标。此外,PVG在训练/渲染过程中的加速表现出50/6000-倍的优势。
https://arxiv.org/abs/2311.18561
Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, we propose E2PNet, the first learning-based method for event-to-point cloud registration. The core of E2PNet is a novel feature representation network called Event-Points-to-Tensor (EP2T), which encodes event data into a 2D grid-shaped feature tensor. This grid-shaped feature enables matured RGB-based frameworks to be easily used for event-to-point cloud registration, without changing hyper-parameters and the training procedure. EP2T treats the event input as spatio-temporal point clouds. Unlike standard 3D learning architectures that treat all dimensions of point clouds equally, the novel sampling and information aggregation modules in EP2T are designed to handle the inhomogeneity of the spatial and temporal dimensions. Experiments on the MVSEC and VECtor datasets demonstrate the superiority of E2PNet over hand-crafted and other learning-based methods. Compared to RGB-based registration, E2PNet is more robust to extreme illumination or fast motion due to the use of event data. Beyond 2D-3D registration, we also show the potential of EP2T for other vision tasks such as flow estimation, event-to-image reconstruction and object recognition. The source code can be found at: this https URL.
事件相机近年来因其无与伦比的时序分辨率和动态范围而成为一项有前景的视觉传感器。虽然将2D RGB图像与3D点云进行注册是一个长期困扰计算机视觉的问题,但之前的工作没有研究过事件相机的2D-3D注册。因此,我们提出了E2PNet,第一个基于学习的事件到点云的注册方法。E2PNet的核心是一个名为Event-Points-to-Tensor(EP2T)的新特征表示网络,它将事件数据编码成一个2D网格形状的特征张量。这个网格形状的特征使得成熟基于RGB的框架可以很容易地用于事件到点云的注册,而无需更改超参数和训练过程。EP2T将事件输入视为空间和时间点云。与标准的3D学习架构不同,EP2T中的新颖采样和信息聚合模块旨在处理空间和时间维度的非均匀性。在MVSEC和VECtor数据集上的实验证明,E2PNet相对于手工craft和其他学习方法具有优越性。与基于RGB的注册相比,E2PNet对极端光照或快速运动更加鲁棒。除了2D-3D注册之外,我们还展示了EP2T在流估计、事件到图像重建和物体识别等其他视觉任务上的潜在能力。源代码可以从该链接找到:https:// this URL。
https://arxiv.org/abs/2311.18433
Neural radiance fields (NeRFs) have achieved impressive view synthesis results by learning an implicit volumetric representation from multi-view images. To project the implicit representation into an image, NeRF employs volume rendering that approximates the continuous integrals of rays as an accumulation of the colors and densities of the sampled points. Although this approximation enables efficient rendering, it ignores the direction information in point intervals, resulting in ambiguous features and limited reconstruction quality. In this paper, we propose an anisotropic neural representation learning method that utilizes learnable view-dependent features to improve scene representation and reconstruction. We model the volumetric function as spherical harmonic (SH)-guided anisotropic features, parameterized by multilayer perceptrons, facilitating ambiguity elimination while preserving the rendering efficiency. To achieve robust scene reconstruction without anisotropy overfitting, we regularize the energy of the anisotropic features during training. Our method is flexiable and can be plugged into NeRF-based frameworks. Extensive experiments show that the proposed representation can boost the rendering quality of various NeRFs and achieve state-of-the-art rendering performance on both synthetic and real-world scenes.
神经辐射场(NeRFs)通过从多视角图像中学习隐式体积表示,实现了令人印象深刻的视图合成结果。为了将隐式表示投影到图像中,NeRF采用体积渲染,将射线的连续积分近似为采样点的颜色和密度的累积。尽管这个近似能够实现高效的渲染,但它忽略了点区间中的方向信息,导致模糊的特征和有限的重建质量。在本文中,我们提出了一个自适应神经表示学习方法,利用可学习的多视角视觉特征来提高场景表示和重建。我们将体积函数建模为球面余弦(SH)-指导的异质特征,由多层感知器参数化,促进模糊消除,同时保留渲染效率。为了实现没有偏移的稳健场景重建,我们在训练过程中对异质特征的能量进行正则化。我们的方法是灵活的,可以将其集成到NeRF基于的框架中。大量的实验证明,所提出的表示可以提高各种NeRF的渲染质量,实现真实世界和合成场景的最先进的渲染性能。
https://arxiv.org/abs/2311.18311
Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.
人类视觉识别系统表现出将视觉信息压缩成包含丰富表示的一组标记的令人惊讶的能力,而不受标签监督。其背后的关键推动力是感知聚类。尽管在2010年初期计算机视觉中广泛使用,但仍然是一个谜:是否可以利用感知聚类从监督中提取神经视觉识别骨架并生成具有如此强大表示的模型。在本文中,我们提出了Perceptual Group Tokenizer,一种完全依赖聚类操作来提取视觉特征并执行自监督表示学习的模型,其中一系列聚类操作用于迭代猜测像素或子像素的上下文以优化特征表示。我们证明了与最先进的视觉架构相比,所提出的模型具有竞争优势,并继承了有利的特性,包括无需重新训练的自适应计算和可解释性。具体来说,Perceptual Group Tokenizer在ImageNet-1K自监督学习基准上实现了80.3%的分数,标志着在当前范式下取得了新的进展。
https://arxiv.org/abs/2311.18296
While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (this https URL) and code (this https URL) are available publicly.
虽然许多无监督学习模型集中于一个或两个家族的任务,无论是生成还是判别性的,但我们探索了统一表示学习者的可能性:一个同时处理这两个家族任务的模型。我们选中了扩散模型,这是一种最先进的生成任务方法,作为潜在的统一表示学习者的候选者。这类模型涉及训练一个U-Net以迭代预测和删除噪声,并因此得到的模型可以合成高保真度、多样、新颖的图像。我们发现U-Net的中间特征图具有多样性、判别性特征表示。我们提出了一个新的关注机制用于池化特征图,并利用这个机制作为DifFormer,一种来自不同扩散U-Net块和噪声步骤的特征变换器。我们还开发了DifFeed,一种专为扩散设计的反馈机制。我们发现,扩散模型比GAN更好,并且,通过我们的融合和反馈机制,可以与最先进的无监督图像表示学习方法竞争,这些方法可以用于全监督和半监督任务——包括带全和半监督的图像分类、微细化分类、目标检测和分割,以及语义分割。我们的项目网站(此https URL)和代码(此https URL)都是公开的。
https://arxiv.org/abs/2311.17921
We introduce SODA, a self-supervised diffusion model, designed for representation learning. The model incorporates an image encoder, which distills a source view into a compact representation, that, in turn, guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder, and leveraging novel view synthesis as a self-supervised objective, we can turn diffusion models into strong representation learners, capable of capturing visual semantics in an unsupervised manner. To the best of our knowledge, SODA is the first diffusion model to succeed at ImageNet linear-probe classification, and, at the same time, it accomplishes reconstruction, editing and synthesis tasks across a wide range of datasets. Further investigation reveals the disentangled nature of its emergent latent space, that serves as an effective interface to control and manipulate the model's produced images. All in all, we aim to shed light on the exciting and promising potential of diffusion models, not only for image generation, but also for learning rich and robust representations.
我们介绍了SODA,一种自监督扩散模型,专为表示学习而设计。该模型包括一个图像编码器,将原始图像压缩成一个紧凑的表示,进而指导相关的新颖视图的生成。我们证明了通过在编码器和去噪解码器之间施加严格的瓶颈,并利用新颖视图合成作为自监督目标,我们可以将扩散模型变成强大的表示学习器,能够以无监督方式捕捉视觉语义。据我们所知,SODA是第一个在ImageNet线性探针分类中成功的扩散模型,同时,它还完成了在各种数据集上的重建、编辑和合成任务。进一步的调查揭示了其浮现式潜在空间的不分离性质,这作为控制和操作模型产生的图像的有效接口。总的来说,我们希望阐明扩散模型的令人兴奋和有前景的潜力,不仅限于图像生成,而且也适用于学习丰富的和鲁棒的表示。
https://arxiv.org/abs/2311.17901
Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods, however, rely on fixed aggregation model hyperparameters, a framework we identify as potentially biasing the results. Our study uncovers a co-dependence between feature extractor models and aggregation model hyperparameters, indicating that performance comparability can be skewed based on the chosen hyperparameters. By accounting for this co-dependency, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the relationship between feature extractors and aggregation models, leading to a fairer and more accurate assessment of feature extractor models in digital pathology.
数字病理学通过分析 gigapixel Whole-Slide Images (WSI) 显著提高了疾病检测和病理学家效率。在這個過程中,WSIs 首先被分為斑塊,對其應用一個特徵提取器模型獲得特徵向量,然後由聚合模型進行後續處理以預測相應的 WSI 標籤。隨著表示學習的快速發展,出現了許多新的特徵提取器模型,通常稱為基礎模型。然而,傳統評估方法依賴於固定的聚合模型超參數,這種框架我們認為可能偏颇結果。我們的研究揭示了特徵提取器模型和聚合模型超參數之間的共同依賴關係,表明性能可讀性可能基於選擇的超參數而有所偏差。通過考慮這種共同依賴關係,我們發現許多现有特徵提取器模型的性能非常相似。我們通過在三個不同的數據集上評估七個特徵提取器模型,with 162 different aggregation model configurations,來驗證這個見解。這種全面的方法提供了一個更精確的視角,說明了特徵提取器和聚合模型之間的關係,有助於更公平和準確地評估數字病理學中的特徵提取器模型。
https://arxiv.org/abs/2311.17804
Self-supervised learning is an efficient pre-training method for medical image analysis. However, current research is mostly confined to specific-modality data pre-training, consuming considerable time and resources without achieving universality across different modalities. A straightforward solution is combining all modality data for joint self-supervised pre-training, which poses practical challenges. Firstly, our experiments reveal conflicts in representation learning as the number of modalities increases. Secondly, multi-modal data collected in advance cannot cover all real-world scenarios. In this paper, we reconsider versatile self-supervised learning from the perspective of continual learning and propose MedCoSS, a continuous self-supervised learning approach for multi-modal medical data. Unlike joint self-supervised learning, MedCoSS assigns different modality data to different training stages, forming a multi-stage pre-training process. To balance modal conflicts and prevent catastrophic forgetting, we propose a rehearsal-based continual learning method. We introduce the k-means sampling strategy to retain data from previous modalities and rehearse it when learning new modalities. Instead of executing the pretext task on buffer data, a feature distillation strategy and an intra-modal mixup strategy are applied to these data for knowledge retention. We conduct continuous self-supervised pre-training on a large-scale multi-modal unlabeled dataset, including clinical reports, X-rays, CT scans, MRI scans, and pathological images. Experimental results demonstrate MedCoSS's exceptional generalization ability across nine downstream datasets and its significant scalability in integrating new modality data. Code and pre-trained weight are available at this https URL.
自监督学习是一种有效的医学图像分析预训练方法。然而,目前的研究大多局限于特定模态数据预训练,在各个模态之间缺乏普适性。一个简单的解决方案是联合自监督预训练所有模态数据,这带来了实际挑战。首先,随着模态数量的增加,表示学习存在冲突。其次,提前收集的多模态数据无法涵盖所有现实场景。在本文中,我们从连续学习的角度重新审视了多模态医疗数据的自我监督学习,并提出了MedCoSS,一种连续多模态医疗数据的自监督学习方法。与联合自监督学习不同,MedCoSS将不同模态的数据分配到不同的训练阶段,形成了一个多阶段预训练过程。为了平衡模态冲突并防止灾难性遗忘,我们提出了一个基于演练的连续学习方法。我们引入了k-means抽样策略来保留先前的模态数据,并在学习新模态时对其进行重新抽样。我们不但在缓冲数据上执行预处理任务,还对数据应用了特征蒸馏策略和内模混淆策略以保留知识。我们在包括临床报告、X光片、CT扫描、MRI扫描和病理图像的大型多模态无标签数据集上进行连续自监督预训练。实验结果证明了MedCoSS在九个下游数据集上的非凡泛化能力,以及其在整合新模态数据方面的显著可扩展性。代码和预训练权重可在此https URL找到。
https://arxiv.org/abs/2311.17597
Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.
近年来在深度强化学习方面的进步已经展示了其在解决复杂任务上的潜力。然而,在视觉控制任务的实验中,最先进的强化学习模型在分布式泛化方面存在困难。相反,使用语言表达高级概念和全局上下文相对容易。借鉴大型语言模型的最近成功,我们的主要目标是在强化学习中将语言用于稳健动作选择,以提高状态抽象技术。具体来说,我们关注学习基于语言的视觉特征来增强世界建模,这是一种基于模型的强化学习技术。为了明确表达我们的假设,我们遮盖了图像观察中的一些物体,并为这些遮盖的物体提供文本提示。随后,我们预测遮盖的物体以及周围的区域作为像素重建,类似于基于Transformer的遮盖自动编码器方法。我们提出的LangGWM:语言 grounded world模型在igibson点导航任务的100K个交互步骤基准测试中实现了最先进的性能。此外,我们提出的显式语言 grounded visual representation learning技术具有提高机器人与人类交互模型的潜力,因为我们所提取的视觉特征是语言 grounded的。
https://arxiv.org/abs/2311.17593
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CISDQ), which decouples the representation learning of both old and new knowledge with lightweight query embedding. CISDQ mainly includes three contributions: 1) We define dynamic queries with adaptive background class to exploit past knowledge and learn future classes naturally. 2) CISDQ proposes a class/instance-aware Query Guided Knowledge Distillation strategy to overcome catastrophic forgetting by capturing the inter-class diversity and intra-class identity. 3) Apart from semantic segmentation, CISDQ introduce the continual learning for instance segmentation in which instance-wise labeling and supervision are considered. Extensive experiments on three datasets for two tasks (i.e., continual semantic and instance segmentation are conducted to demonstrate that CISDQ achieves the state-of-the-art performance, specifically, obtaining 4.4% and 2.9% mIoU improvements for the ADE 100-10 (6 steps) setting and ADE 100-5 (11 steps) setting.
基于持续学习的图像分割表现出关键的性能下降,主要原因是灾难性遗忘和背景迁移,因为它们需要不断包含新类。在本文中,我们提出了一种简单而有效的连续图像分割方法:连续动态查询(CISDQ),它通过轻量级查询嵌入解耦了旧知识和新知识的表示学习。CISDQ主要包括以下三个贡献:1)我们定义了动态查询带有自适应背景类,以利用过去知识并自然地学习未来类。2)CISDQ提出了一个类/实例感知的有指导的 knowledge distillation策略,通过捕获类间多样性来克服灾难性遗忘。3)除了语义分割,CISDQ 在实例分割中引入了持续学习,其中实例级别的标注和监督被考虑在内。在三个数据集上进行两个任务的实验(即连续语义和实例分割)以证明 CISDQ 达到最先进的性能,特别是获得 ADE 100-10(6 步)设置下的 4.4% 和 ADE 100-5(11 步)设置下的 mIoU 改进。
https://arxiv.org/abs/2311.17450
Converging Zero Trust (ZT) with learning techniques can solve various operational and security challenges in Distributed Computing Continuum Systems (DCCS). Implementing centralized ZT architecture is seen as unsuitable for the computing continuum (e.g., computing entities with limited connectivity and visibility, etc.). At the same time, implementing decentralized ZT in the computing continuum requires understanding infrastructure limitations and novel approaches to enhance resource access management decisions. To overcome such challenges, we present a novel learning-driven ZT conceptual architecture designed for DCCS. We aim to enhance ZT architecture service quality by incorporating lightweight learning strategies such as Representation Learning (ReL) and distributing ZT components across the computing continuum. The ReL helps to improve the decision-making process by predicting threats or untrusted requests. Through an illustrative example, we show how the learning process detects and blocks the requests, enhances resource access control, and reduces network and computation overheads. Lastly, we discuss the conceptual architecture, processes, and provide a research agenda.
将零信任(ZT)与学习技术相结合可以解决分布式计算连续系统(DCCS)中的各种操作和安全挑战。将集中式ZT架构视为不适用于计算连续(例如,具有有限连接性和可见性的计算实体等)的观点。同时,在计算连续中实现分散式ZT需要理解基础设施限制以及增强资源访问管理决策的新方法。为了克服这些挑战,我们提出了一个专为DCCS设计的全新学习驱动ZT概念架构。我们的目标是通过引入轻量学习策略(如表示学习(ReL))来提高ZT架构的服务质量。通过一个示例,我们证明了学习过程能够检测并阻止请求,提高资源访问控制,并降低网络和计算开销。最后,我们讨论了概念架构、过程以及研究议程。
https://arxiv.org/abs/2311.17447
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. While traditional methods heavily rely on representation learning trained on extensive video data, there exists a significant limitation: obtaining effective video representations proves challenging due to the inherent complexity and variability in human activities.Furthermore, exclusive dependence on video-based learning may constrain a model's capability to generalize across long-tail classes and out-of-distribution scenarios. In this study, we introduce a novel approach for long-term action anticipation using language models (LALM), adept at addressing the complex challenges of long-term activity understanding without the need for extensive training. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement Maximal Marginal Relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that LALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate LALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies. These are achieved without specific fine-tuning.
理解人类活动是在以自我为中心的视觉领域中一个关键而复杂的任务,该领域关注从相机持有者的角度来看捕捉视觉视角。虽然传统方法在训练广泛的视频数据上依赖于表示学习,但存在一个显著的限制:由于人类活动的固有复杂性和变异性,获得有效的视频表示具有挑战性。此外,仅依赖基于视频的学习可能限制模型在长尾类和离散情况下的泛化能力。在这项研究中,我们引入了一种新方法,使用语言模型(LALM)进行长期动作预测,该模型能够解决不需要广泛训练的复杂挑战。我们的方法包括一个动作识别模型来跟踪先前的动作序列和一个视觉-语言模型来阐述相关环境细节。通过利用这些先前的活动提供的上下文,我们设计了一种使用大型语言模型(LLMs)进行动作预测的提示策略。此外,我们还实现了最大边缘相关性,例如选择,以促进LLMs在上下文中的学习。我们的实验结果表明,LALM在Ego4D基准上超越了最先进的Methods。我们进一步验证了LALM在两个额外的基准上,证实了其在不同类别的复杂活动中进行泛化的能力。这些能力不需要进行具体的微调即可实现。
https://arxiv.org/abs/2311.17944