The meanings and relationships of words shift over time. This phenomenon is referred to as semantic this http URL focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic this http URL, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational this http URL address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by leveraging a similarity matrix between the embeddings of the same word through this http URL compute a diachronic word similarity matrix using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic this http URL, by clustering the similarity matrices for different words, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.
单词的意义和关系会随着时间的推移而发生变化,这种现象被称为语义漂移。为了深入了解这种多时期内发生的语义变化,研究在不同时间间隔期间检测到的变化点是不够的,并且使用基于BERT的方法来考察词义比例也面临着较高的计算成本问题。 为了解决这些问题,我们提出了一种简单直观的框架,通过利用相同词汇在其嵌入表示之间的相似性矩阵,以描述多时期内语义变化的发生。我们的方法可以跨任意时间间隔计算出一种历时性的单词相似度矩阵,并使用快速且轻量级的词向量来实现这一点,从而更深入地分析连续的语义变化。 通过聚类不同词汇的相似度矩阵,我们可以无监督地将表现出类似语义漂移行为的词语归为一类。
https://arxiv.org/abs/2501.09538
Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
训练深度神经网络需要大量带有标注的数据集。这些数据集的收集和标注不仅成本极高,还面临着法律和隐私问题。这些问题在许多实际应用中构成了重大限制。为了应对这一挑战,我们引入了一种名为HydraMix的新架构,该架构通过混合同一类中的多个不同图像来生成新的图像组合。HydraMix利用基于分割的混合掩码,在特征空间中学习多种图像内容的融合,并通过无监督和对抗性训练进行优化。我们的数据增强方案允许从非常小的数据集中从头开始创建模型。 我们在ciFAIR-10、STL-10和ciFAIR-100上进行了广泛的实验,并且还引入了一种新的文本-图像指标,用于评估扩充后的数据集的泛化能力。实验结果表明,HydraMix在小数据集上的图像分类任务中优于现有的最先进的方法。
https://arxiv.org/abs/2501.09504
Unsupervised visual defect detection is critical in industrial applications, requiring a representation space that captures normal data features while detecting deviations. Achieving a balance between expressiveness and compactness is challenging; an overly expressive space risks inefficiency and mode collapse, impairing detection accuracy. We propose a novel approach using an enhanced VQ-VAE framework optimized for unsupervised defect detection. Our model introduces a patch-aware dynamic code assignment scheme, enabling context-sensitive code allocation to optimize spatial representation. This strategy enhances normal-defect distinction and improves detection accuracy during inference. Experiments on MVTecAD, BTAD, and MTSD datasets show our method achieves state-of-the-art performance.
无监督的视觉缺陷检测在工业应用中至关重要,需要一个既能捕捉正常数据特征又能识别偏差的表示空间。实现表达性和紧凑性之间的平衡极具挑战;过于丰富的表示空间可能会导致效率低下和模式崩溃,从而影响检测准确性。我们提出了一种使用增强的VQ-VAE框架的方法,该框架针对无监督缺陷检测进行了优化。我们的模型引入了一个基于补丁感知的动态码分配方案,能够根据上下文敏感地进行编码分配,以优化空间表示。这一策略提高了正常数据与异常数据之间的区分度,并在推理阶段提升了检测准确性。在MVTecAD、BTAD和MTSD数据集上的实验表明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2501.09187
Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.
动态MRI重建作为逆问题的一种,通过使用深度学习技术得到了显著的发展。特别是,获取真实数据的实际困难导致了无监督学习方法的出现。其中最近一种有前景的方法是隐式神经表示(INR),它将数据定义为一个连续函数,该函数将坐标值映射到相应的信号值上。这种方法允许仅通过不完整的测量来填补缺失的信息,并有效地解决逆问题。然而,之前采用此方法的工作面临着诸如优化时间长和需要大量超参数调整等缺点。 为了克服这些问题,我们提出了动态感知的INR(DA-INR),这是一种基于INR的动态MRI重建模型,该模型在图像域中捕捉到了动态MRI数据的空间和时间连续性,并明确地将数据的时间冗余纳入了模型结构。因此,即使在极端欠采样比率下,DA-INR也能以显著减少优化时间和最小化超参数调整的前提下,优于其他模型的重建质量。
https://arxiv.org/abs/2501.09049
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at this https URL.
无监督表示学习在各种机器学习任务中取得了显著进展。在计算机视觉领域,最先进的方法利用随机裁剪和颜色抖动等变换来实现不变性表示,在存在变化的情况下也能嵌入语义相同的数据输入。然而,这种方法对于需要精确特征的任务(如定位或花卉分类)来说可能会降低性能表现。为了解决这个问题,最近的研究引入了等变表示学习,该方法捕捉到了转换敏感信息。不过,当前的方法依赖于变换标签,因此在处理互相关性和复杂变换时会遇到困难。 我们提出了一种自监督变换学习(Self-supervised Transformation Learning, STL)方法,它用从图像对中衍生出的变换表示来替代变换标签。这种方法确保了变换表示是图像不变的,并能学习相应的等变变换,从而在不增加批次复杂度的情况下提升了性能表现。我们在广泛的分类和检测任务上展示了该方法的有效性,在11个基准中的7个上超过了现有的方法,并且在检测方面表现出色。 通过结合AugMix这类以前的等变方法无法使用的复杂变换,这种方法可以提升跨任务的表现力,突显了其适应性和稳定性。此外,它与各种基础模型兼容,展示了其灵活性和广泛适用性。相关代码可以在以下网址获得:[提供的链接]。
https://arxiv.org/abs/2501.08712
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at this https URL
语义分割对于理解图像至关重要,但这一过程需要大量的像素级详细标注。在现实世界中获取这些标注可能会非常昂贵。无监督领域适应(UDA)是一种利用带有标签的虚拟数据来训练模型,并将其调整应用于没有标签的真实数据的技术。最近的一些研究工作使用对比学习这种强大的自监督学习方法来进行此技术,但它们未能考虑到每类内部特征的多样性,在使用对比学习时会导致类别预测错误。我们分析了这些工作的局限性,并提出了一种名为伪标签引导像素对比(PGPC)的新框架来克服先前方法的缺点。此外,我们还研究如何利用目标图像中的更多信息而不引入来自伪标签的噪声。我们在两个标准UDA基准测试上测试了我们的方法,并表明它优于现有方法。具体来说,在基于DAFormer的GTA5到Cityscapes和SYNTHIA到Cityscapes任务中,分别实现了5.1%mIoU和4.6%mIoU的相对改进。此外,我们的方法可以增强其他UDA方法的表现而不增加模型复杂度。代码可在以下链接获取:[此处插入URL]
https://arxiv.org/abs/2501.09040
Learning concepts from natural high-dimensional data (e.g., images) holds potential in building human-aligned and interpretable machine learning models. Despite its encouraging prospect, formalization and theoretical insights into this crucial task are still lacking. In this work, we formalize concepts as discrete latent causal variables that are related via a hierarchical causal model that encodes different abstraction levels of concepts embedded in high-dimensional data (e.g., a dog breed and its eye shapes in natural images). We formulate conditions to facilitate the identification of the proposed causal model, which reveals when learning such concepts from unsupervised data is possible. Our conditions permit complex causal hierarchical structures beyond latent trees and multi-level directed acyclic graphs in prior work and can handle high-dimensional, continuous observed variables, which is well-suited for unstructured data modalities such as images. We substantiate our theoretical claims with synthetic data experiments. Further, we discuss our theory's implications for understanding the underlying mechanisms of latent diffusion models and provide corresponding empirical evidence for our theoretical insights.
从自然高维数据(如图像)中学习概念具有构建与人类价值观一致且可解释的机器学习模型的潜力。尽管这一领域前景广阔,但该任务的形式化和理论洞察仍不充分。在这项工作中,我们将概念形式化为通过分层因果模型关联的离散潜在因果变量,这种模型编码了嵌入在高维数据中的不同抽象层次的概念(如自然图像中的狗品种及其眼形)。我们制定了有利于识别所提议的因果模型的条件,揭示了从无监督数据中学习此类概念何时成为可能。我们的条件允许比先前工作中假设的潜在树和多层次有向无环图更复杂的因果分层结构,并且能够处理高维、连续的观察变量,这非常适合如图像这样的非结构化数据模式。 我们通过合成数据实验来证实我们的理论主张。此外,我们讨论了该理论对理解潜在扩散模型底层机制的影响,并提供了相应的实证证据支持我们的理论洞察。
https://arxiv.org/abs/2406.00519
RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.
基于RGB的3D姿态估计方法随着深度学习的发展和高质量3D姿态数据集的出现而取得了成功。然而,大多数现有的方法在处理与训练数据分布差异较大的测试图像时表现不佳。尽管可以通过在训练过程中引入多样化数据来缓解这一问题,但收集带有对应标签(即3D姿势)的多样数据并非易事。在这篇论文中,我们介绍了一个用于3D姿态估计的无监督领域适应框架,该框架利用了未标注的数据和已标注的数据,通过遮罩图像建模(MIM)框架实现这一点。为了进一步提高未标注数据使用的有效性,我们提出了以前景为中心的重建和注意力正则化方法。我们在人类和手部姿态估计任务的各种数据集上进行了实验,特别是在跨域场景中使用了这些技术。我们的研究表明,在所有数据集中均达到了最先进的准确性。 这段翻译阐述了一个用于3D姿态估计的新框架,并强调了该研究在处理不同领域测试图像时的有效性。通过引入无监督领域适应和未标注数据的利用,这种方法能够在现有方法的基础上取得显著改进。
https://arxiv.org/abs/2501.08408
Anomaly detection (AD) plays a pivotal role in AI applications, e.g., in classification, and intrusion/threat detection in cybersecurity. However, most existing methods face challenges of heterogeneity amongst feature subsets posed by non-independent and identically distributed (non-IID) data. We propose a novel neural network model called Multiple-Input Auto-Encoder for AD (MIAEAD) to address this. MIAEAD assigns an anomaly score to each feature subset of a data sample to indicate its likelihood of being an anomaly. This is done by using the reconstruction error of its sub-encoder as the anomaly score. All sub-encoders are then simultaneously trained using unsupervised learning to determine the anomaly scores of feature subsets. The final AUC of MIAEAD is calculated for each sub-dataset, and the maximum AUC obtained among the sub-datasets is selected. To leverage the modelling of the distribution of normal data to identify anomalies of the generative models, we develop a novel neural network architecture/model called Multiple-Input Variational Auto-Encoder (MIVAE). MIVAE can process feature subsets through its sub-encoders before learning distribution of normal data in the latent space. This allows MIVAE to identify anomalies that deviate from the learned distribution. We theoretically prove that the difference in the average anomaly score between normal samples and anomalies obtained by the proposed MIVAE is greater than that of the Variational Auto-Encoder (VAEAD), resulting in a higher AUC for MIVAE. Extensive experiments on eight real-world anomaly datasets demonstrate the superior performance of MIAEAD and MIVAE over conventional methods and the state-of-the-art unsupervised models, by up to 6% in terms of AUC score. Alternatively, MIAEAD and MIVAE have a high AUC when applied to feature subsets with low heterogeneity based on the coefficient of variation (CV) score.
异常检测(Anomaly Detection,AD)在人工智能应用中扮演着关键角色,例如,在分类和网络安全中的入侵/威胁检测。然而,大多数现有的方法面临着由于非独立同分布(non-IID)数据导致的特征子集异质性的挑战。我们提出了一种名为多输入自动编码器异常检测(Multiple-Input Auto-Encoder for AD, MIAEAD)的新神经网络模型来解决这个问题。MIAEAD为每个数据样本中的特征子集分配一个异常分数,以指示其成为异常的可能性。这是通过使用该子编码器的重构误差作为异常分数实现的。然后,所有子编码器都同时采用无监督学习进行训练,以确定特征子集的异常得分。对于MIAEAD而言,每个子数据集的最终AUC值被计算出来,并选择在这些子数据集中获得的最大AUC。 为了利用正常数据分布模型来识别生成模型中的异常情况,我们开发了一种名为多输入变分自动编码器(Multiple-Input Variational Auto-Encoder, MIVAE)的新神经网络架构/模型。MIVAE可以通过其子编码器处理特征子集,并在隐含空间中学习正常数据的分布。这使得MIVAE能够识别偏离所学分布的数据点作为异常情况。 我们从理论上证明了由提出的MIVAE得到的正常样本和异常之间的平均异常分数差值大于变分自动编码器(VAEAD)所得结果,导致MIVAE具有更高的AUC值。 在八个实际数据集上进行的广泛实验表明,与传统的常规方法及最先进的无监督模型相比,MIAEAD和MIVAE表现出卓越的表现,其AUC得分提高了高达6%。此外,在基于变异系数(CV)分数评估低异质性特征子集时,MIAEAD和MIVAE同样显示出了高AUC值。
https://arxiv.org/abs/2501.08149
Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Absolute robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised, and model-agnostic method that unifies detection of all kinds of shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine the newly available Vision Foundation Models (VFM) as feature extractors with one of four alternative density modeling techniques. In an extensive benchmark of 4 VFMs against 20 baselines, we show the superior performance of VFM feature encodings compared to shift-specific OOD monitors. Additionally, we find that sophisticated architectures outperform larger latent space dimensionality; and our method identifies samples with higher risk of errors on downstream tasks, despite being model-agnostic. This suggests that VFMs are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks.
深度神经网络(DNN)在如自动驾驶(AD)等复杂的开放世界领域中仍面临分布变化的挑战:无法绝对保证对未知新物体(语义偏移)或光照条件等风格的变化(协变量偏移)具有鲁棒性。因此,用于识别超出训练数据分布(OOD)场景的操作时间监控器至关重要。目前针对复杂领域如AD的OOD分类方法未经测试、检测到的偏移类型有限,甚至需要使用OOD样本进行监督。为了应对未预见的变化,我们建立了一个以一种原则化、无监督且模型不可知的方法为中心的框架:该方法通过构建训练数据特征分布的完整模型,并将其密度作为新点的在分布(ID)评分来统一检测所有类型的偏移。 为实现这一目标,我们建议将新可用的视觉基础模型(VFM)用作特征提取器,并与四种不同的密度建模技术之一结合使用。在对4种VFMs和20个基线进行广泛基准测试后,我们展示了VFM特征编码相较于特定于偏移变化的OOD监控器具有优越性能。此外,我们发现复杂架构比更大的潜在空间维度更胜一筹;并且我们的方法能够识别下游任务中错误风险更高的样本,尽管它是模型不可知的。 这表明VFMs在实现无监督、可靠的安全监测方面对于复杂的视觉任务具有前景,特别是在需要模型不可知特性的场景下。
https://arxiv.org/abs/2501.08083
In this tutorial, we explore Variational Autoencoders (VAEs), an essential framework for unsupervised learning, particularly suited for high-dimensional datasets such as neuroimaging. By integrating deep learning with Bayesian inference, VAEs enable the generation of interpretable latent representations. This tutorial outlines the theoretical foundations of VAEs, addresses practical challenges such as convergence issues and over-fitting, and discusses strategies like the reparameterization trick and hyperparameter optimization. We also highlight key applications of VAEs in neuroimaging, demonstrating their potential to uncover meaningful patterns, including those associated with neurodegenerative processes, and their broader implications for analyzing complex brain data.
在这篇教程中,我们将探讨变分自编码器(VAEs),这是一种重要的无监督学习框架,在处理高维度数据集(如神经影像)时尤其有用。通过将深度学习与贝叶斯推断相结合,VAEs能够生成可解释的潜在表示形式。本教程概述了VAEs的理论基础,并且讨论了实际应用中的挑战,例如收敛问题和过拟合问题,同时介绍了策略,比如重参数技巧(reparameterization trick)和超参数优化。此外,我们还将重点介绍VAEs在神经影像领域的关键应用案例,展示它们如何揭示具有意义的模式,包括与神经退行性疾病相关的模式,并探讨其对于复杂脑数据分析的更广泛影响。
https://arxiv.org/abs/2501.08009
Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code will be made publicly available
分布外(OOD)检测在许多应用中具有重要意义。尽管语义和领域变化的OOD问题已经得到了广泛的研究,但本研究重点关注协变量偏移——数据分布中的细微变化可能导致机器学习性能下降。我们假设通过检测这些细微的变化可以更好地理解内部分布边界,从而提高OOD检测能力。我们在使用批量归一化(BN)训练的对抗判别器中发现了一个特性:真实样本和对抗性样本形成了具有独特批次统计特性的不同领域。我们利用这一特性开发了一种新的无监督对抗变分自编码器(VAE)框架——DisCoPatch,以改进OOD检测。 在推断过程中,DisCoPatch将同一图像的补丁组合成批次数据,从而确保一致的数据分布,使模型能够依赖于批次统计信息。该方法利用VAE生成和重构的次优输出作为判别器训练中的负样本,进一步提升其区分内部分布样本与协变量偏移的能力。通过收紧这一边界,DisCoPatch在公开OOD检测基准测试中取得了最先进的结果。 所提出的模型不仅在检测协变量偏移方面表现出色,在ImageNet-1K-C数据集上达到了95.5%的AUROC值,而且还在公开近似OOD(Near-OOD)基准测试中超越了所有先前的方法,达到了95.0%的成绩。此外,得益于25MB的小模型尺寸和显著低于现有方法的延迟,DisCoPatch在实际应用中的OOD检测性能表现高效且实用。 本研究的代码将在未来公开发布,供学术界与工业界的同行参考使用。
https://arxiv.org/abs/2501.08005
Enhancing the precision of segmenting coronary atherosclerotic plaques from CT Angiography (CTA) images is pivotal for advanced Coronary Atherosclerosis Analysis (CAA), which distinctively relies on the analysis of vessel cross-section images reconstructed via Curved Planar Reformation. This task presents significant challenges due to the indistinct boundaries and structures of plaques and blood vessels, leading to the inadequate performance of current deep learning models, compounded by the inherent difficulty in annotating such complex data. To address these issues, we propose a novel dual-consistency semi-supervised framework that integrates Intra-frame Topological Consistency (ITC) and Cross-frame Topological Consistency (CTC) to leverage labeled and unlabeled data. ITC employs a dual-task network for simultaneous segmentation mask and Skeleton-aware Distance Transform (SDT) prediction, achieving similar prediction of topology structure through consistency constraint without additional annotations. Meanwhile, CTC utilizes an unsupervised estimator for analyzing pixel flow between skeletons and boundaries of adjacent frames, ensuring spatial continuity. Experiments on two CTA datasets show that our method surpasses existing semi-supervised methods and approaches the performance of supervised methods on CAA. In addition, our method also performs better than other methods on the ACDC dataset, demonstrating its generalization.
提高冠状动脉粥样硬化斑块从CT血管造影(CTA)图像中分割的精确度,对于先进的冠状动脉粥样硬化分析(CAA)至关重要。这种分析特别依赖于通过曲面平面重构重建的血管横截面图像。然而,由于斑块和血管边界的不清晰结构以及斑块本身的复杂性,这项任务面临着巨大挑战,导致现有深度学习模型性能不足,并且由于此类数据标注难度大而进一步加剧了这一问题。 为解决这些问题,我们提出了一种新颖的双重一致性半监督框架,该框架结合了帧内拓扑一致性和跨帧拓扑一致性,以利用有标签和无标签的数据。帧内拓扑一致性(ITC)采用双任务网络进行同时分割掩码预测与骨骼感知距离变换(SDT)预测,通过一致性约束实现相似的拓扑结构预测,并且无需额外标注即可完成。与此同时,跨帧拓扑一致性(CTC)使用无监督估计器来分析相邻帧中骨架和边界之间的像素流,以确保空间连续性。 实验结果表明,在两个CTA数据集上,我们的方法超过了现有的半监督方法,并接近了监督方法在CAA中的表现水平。此外,我们的方法也在ACDC数据集中表现出优于其他方法的性能,展示了其泛化能力。
https://arxiv.org/abs/2501.07850
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter-frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in unsupervised video object segmentation but also delivers competitive results in video salient object detection. These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. Source code is available on this https URL.
在这篇论文中,我们通过提出一种名为MTNet的高效算法来解决无监督视频对象分割(UVOS)中的挑战。该算法同时利用了运动和时间线索。与以往专注于将外观与运动相结合或建模时间关系的方法不同,我们的方法通过在一个统一框架内整合这两个方面,实现了它们的有效结合。MTNet的设计在于,在编码器的特征提取过程中有效地融合了外观和运动特征,从而促进更互补的表示形式。 为了捕捉视频中复杂的长距离上下文动态和信息,我们引入了一个时间变换模块,这有助于在整个视频片段中实现有效的帧间交互。此外,我们在所有特征级别上使用了一连串的解码器来充分利用提取到的特征,并致力于生成越来越精确的分割掩模。 因此,MTNet提供了一个强大而紧凑的框架,探索了时间和跨模式的知识,从而能够在各种复杂场景下高效地准确定位和跟踪主要对象。在多个基准测试中的广泛实验最终证明,我们的方法不仅在无监督视频对象分割方面达到了最先进的性能,在视频显著目标检测中也提供了具有竞争力的结果。 这些发现突显了该方法的稳健性和适应性,以及其对一系列分割任务的有效应对能力。源代码可在[这个链接](https://this_https_URL.com)获取。
https://arxiv.org/abs/2501.07806
The increasing level of autonomy of robots poses challenges of trust and social acceptance, especially in human-robot interaction scenarios. This requires an interpretable implementation of robotic cognitive capabilities, possibly based on formal methods as logics for the definition of task specifications. However, prior knowledge is often unavailable in complex realistic scenarios. In this paper, we propose an offline algorithm based on inductive logic programming from noisy examples to extract task specifications (i.e., action preconditions, constraints and effects) directly from raw data of few heterogeneous (i.e., not repetitive) robotic executions. Our algorithm leverages on the output of any unsupervised action identification algorithm from video-kinematic recordings. Combining it with the definition of very basic, almost task-agnostic, commonsense concepts about the environment, which contribute to the interpretability of our methodology, we are able to learn logical axioms encoding preconditions of actions, as well as their effects in the event calculus paradigm. Since the quality of learned specifications depends mainly on the accuracy of the action identification algorithm, we also propose an online framework for incremental refinement of task knowledge from user feedback, guaranteeing safe execution. Results in a standard manipulation task and benchmark for user training in the safety-critical surgical robotic scenario, show the robustness, data- and time-efficiency of our methodology, with promising results towards the scalability in more complex domains.
机器人自主性的提升带来了信任和社交接受度方面的挑战,尤其是在人机交互场景中。这要求一种可解释的机器人认知能力实现方式,可能基于形式化方法如逻辑来定义任务规格说明。然而,在复杂的现实场景中,先验知识往往不可用。在本文中,我们提出了一种基于归纳逻辑编程的离线算法,该算法可以从少量异构(即不重复)机器人执行的原始数据中的噪声示例中提取任务规格说明(例如,动作前提条件、约束和效果)。我们的算法利用任何无监督的动作识别算法从视频-运动记录产生的输出。结合环境的基本常识概念定义,这些概念几乎是与具体任务无关的,并有助于我们方法的可解释性,我们能够学习编码行动先决条件以及在事件演算范式中的效果的逻辑公理。由于所学规格的质量主要取决于动作识别算法的准确性,我们也提出了一种在线框架,用于从用户反馈中增量地细化任务知识,以保证安全执行。在一个标准的手动操作任务和针对安全关键手术机器人场景中用户的培训基准测试的结果表明,我们的方法具有鲁棒性、数据效率和时间效率,并且在更复杂的领域扩展方面也显示出有前景的结果。
https://arxiv.org/abs/2501.07507
Circadian rhythms regulate the physiology and behavior of humans and animals. Despite advancements in understanding these rhythms and predicting circadian phases at the transcriptional level, predicting circadian phases from proteomic data remains elusive. This challenge is largely due to the scarcity of time labels in proteomic datasets, which are often characterized by small sample sizes, high dimensionality, and significant noise. Furthermore, existing methods for predicting circadian phases from transcriptomic data typically rely on prior knowledge of known rhythmic genes, making them unsuitable for proteomic datasets. To address this gap, we developed a novel computational method using unsupervised deep learning techniques to predict circadian sample phases from proteomic data without requiring time labels or prior knowledge of proteins or genes. Our model involves a two-stage training process optimized for robust circadian phase prediction: an initial greedy one-layer-at-a-time pre-training which generates informative initial parameters followed by fine-tuning. During fine-tuning, a specialized loss function guides the model to align protein expression levels with circadian patterns, enabling it to accurately capture the underlying rhythmic structure within the data. We tested our method on both time-labeled and unlabeled proteomic data. For labeled data, we compared our predictions to the known time labels, achieving high accuracy, while for unlabeled human datasets, including postmortem brain regions and urine samples, we explored circadian disruptions. Notably, our analysis identified disruptions in rhythmic proteins between Alzheimer's disease and control subjects across these samples.
昼夜节律调节着人类和动物的生理和行为。尽管在理解这些节律并在转录水平上预测昼夜阶段方面取得了进展,但从蛋白质组数据中预测昼夜节律仍然难以实现。这一挑战主要由于蛋白质组数据集中时间标签稀少、样本量小、维度高以及噪声大等原因所致。此外,现有的从转录组数据中预测昼夜节律的方法通常依赖于已知的振荡基因知识,这使得它们不适合用于蛋白质组数据集。为了解决这个问题,我们开发了一种新的计算方法,利用无监督深度学习技术来直接从蛋白质组数据中预测昼夜样本阶段,而无需时间标签或关于蛋白质和基因的先验知识。 我们的模型包括一个两阶段的训练过程,旨在优化稳健的昼夜节律阶段预测:初始贪婪逐层预训练产生有用的初步参数,随后是微调。在微调期间,特殊的损失函数指导模型将蛋白质表达水平与昼夜模式对齐,使模型能够准确捕获数据中的内在振荡结构。 我们使用带有时间标签和无时间标签的蛋白质组数据集测试了我们的方法。对于有标签的数据,我们将预测结果与已知的时间标签进行比较,取得了高准确性;而对于未标记的人类数据集(包括死后脑区样本和尿液样本),我们探索了昼夜节律紊乱情况。值得注意的是,在这些样本中,我们的分析在阿尔茨海默病患者和对照组之间发现了振荡蛋白质的异常变化。
https://arxiv.org/abs/2501.07405
Acquiring face images of sufficiently high quality is important for online ID and travel document issuance applications using face recognition systems (FRS). Low-quality, manipulated (intentionally or unintentionally), or distorted images degrade the FRS performance and facilitate documents' misuse. Securing quality for enrolment images, especially in the unsupervised self-enrolment scenario via a smartphone, becomes important to assure FRS performance. In this work, we focus on the less studied area of radial distortion (a.k.a., the fish-eye effect) in face images and its impact on FRS performance. We introduce an effective radial distortion detection model that can detect and flag radial distortion in the enrolment scenario. We formalize the detection model as a face image quality assessment (FIQA) algorithm and provide a careful inspection of the effect of radial distortion on FRS performance. Evaluation results show excellent detection results for the proposed models, and the study on the impact on FRS uncovers valuable insights into how to best use these models in operational systems.
获取高质量的面部图像对于使用人脸识别系统(FRS)进行在线身份验证和旅行证件发放应用非常重要。低质量、被篡改(有意或无意地)或变形的图像会降低FRS性能,并增加文档滥用的风险。确保注册图像的质量,特别是在通过智能手机进行无监督自我注册场景中尤为重要,以保证FRS的性能。在这项工作中,我们关注较少研究的面部图像径向畸变(又称鱼眼效应)领域及其对FRS性能的影响。我们介绍了一种有效的径向畸变检测模型,可以检测并在注册过程中标记径向畸变。我们将检测模型正式定义为一种面部图像质量评估(FIQA)算法,并详细检查了径向畸变对FRS性能的影响。评价结果显示提出的模型具有出色的检测效果,而关于其对FRS影响的研究揭示了如何在实际系统中最佳使用这些模型的重要见解。
https://arxiv.org/abs/2501.07179
Superpixel segmentation is a foundation for many higher-level computer vision tasks, such as image segmentation, object recognition, and scene understanding. Existing graph-based superpixel segmentation methods typically concentrate on the relationships between a given pixel and its directly adjacent pixels while overlooking the influence of non-adjacent pixels. These approaches do not fully leverage the global information in the graph, leading to suboptimal segmentation quality. To address this limitation, we present SIT-HSS, a hierarchical superpixel segmentation method based on structural information theory. Specifically, we first design a novel graph construction strategy that incrementally explores the pixel neighborhood to add edges based on 1-dimensional structural entropy (1D SE). This strategy maximizes the retention of graph information while avoiding an overly complex graph structure. Then, we design a new 2D SE-guided hierarchical graph partitioning method, which iteratively merges pixel clusters layer by layer to reduce the graph's 2D SE until a predefined segmentation scale is achieved. Experimental results on three benchmark datasets demonstrate that the SIT-HSS performs better than state-of-the-art unsupervised superpixel segmentation algorithms. The source code is available at \url{this https URL}.
超像素分割是许多高级计算机视觉任务的基础,例如图像分割、物体识别和场景理解。现有的基于图的超像素分割方法通常集中在给定像素与其直接相邻像素之间的关系上,而忽略了非相邻像素的影响。这些方法没有充分利用图中的全局信息,导致分割质量欠佳。为了解决这一局限性,我们提出了SIT-HSS(基于结构信息理论的分层超像素分割方法)。具体而言,首先设计了一种新颖的图构建策略,该策略通过逐步探索像素邻域,并根据一维结构熵(1D SE)添加边来实现,从而在保留图信息的同时避免了过于复杂的图结构。然后,我们提出了一种新的基于2D SE指导的分层图划分方法,该方法逐层迭代地合并像素簇以减少图中的2D SE,直到达到预定义的分割尺度为止。 实验结果表明,在三个基准数据集上SIT-HSS算法的表现优于当前最先进的无监督超像素分割算法。源代码可在[此链接](this https URL)获取。
https://arxiv.org/abs/2501.07069
The automatic identification of Magnetic Resonance Imaging (MRI) sequences can streamline clinical workflows by reducing the time radiologists spend manually sorting and identifying sequences, thereby enabling faster diagnosis and treatment planning for patients. However, the lack of standardization in the parameters of MRI scans poses challenges for automated systems and complicates the generation and utilization of datasets for machine learning research. To address this issue, we propose a system for MRI sequence identification using an unsupervised contrastive deep learning framework. By training a convolutional neural network based on the ResNet-18 architecture, our system classifies nine common MRI sequence types as a 9-class classification problem. The network was trained using an in-house internal dataset and validated on several public datasets, including BraTS, ADNI, Fused Radiology-Pathology Prostate Dataset, the Breast Cancer Dataset (ACRIN), among others, encompassing diverse acquisition protocols and requiring only 2D slices for training. Our system achieves a classification accuracy of over 0.95 across the nine most common MRI sequence types.
磁共振成像(MRI)序列的自动识别可以简化临床工作流程,通过减少放射科医生手动排序和识别序列的时间,从而加快患者的诊断和治疗计划。然而,MRI扫描参数缺乏标准化给自动化系统带来了挑战,并且使机器学习研究中数据集的生成和利用变得更加复杂。为了解决这个问题,我们提出了一种使用无监督对比深度学习框架进行MRI序列识别的系统。通过基于ResNet-18架构训练卷积神经网络,我们的系统将九种常见的MRI序列类型作为9类分类问题来进行分类。该网络使用内部数据集进行训练,并在包括BraTS、ADNI、融合放射病理前列腺数据集以及乳腺癌数据集(ACRIN)在内的多个公共数据集上进行了验证,这些数据集涵盖了不同的采集协议,并且仅需2D切片即可完成训练。我们的系统在这九种最常见的MRI序列类型中实现了超过95%的分类准确率。
https://arxiv.org/abs/2501.06938