Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Modeling non-stationary data is a challenging problem in the field of continual learning, and data distribution shifts may result in negative consequences on the performance of a machine learning model. Classic learning tools are often vulnerable to perturbations of the input covariates, and are sensitive to outliers and noise, and some tools are based on rigid algebraic assumptions. Distribution shifts are frequently occurring due to changes in raw materials for production, seasonality, a different user base, or even adversarial attacks. Therefore, there is a need for more effective distribution shift detection techniques. In this work, we propose a continual learning framework for monitoring and detecting distribution changes. We explore the problem in a latent space generated by a bio-inspired self-organizing clustering and statistical aspects of the latent space. In particular, we investigate the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map. Our method can be applied in both a supervised and an unsupervised context. We construct the assessment of changes in the data distribution as a comparison of Gaussian signals, making the proposed method fast and robust. We compare it to other unsupervised techniques, specifically Principal Component Analysis (PCA) and Kernel-PCA. Our comparison involves conducting experiments using sequences of images (based on MNIST and injected shifts with adversarial samples), chemical sensor measurements, and the environmental variable related to ozone levels. The empirical study reveals the potential of the proposed approach.
建模非平稳数据是连续学习领域的一个具有挑战性的问题,数据分布的变化可能导致机器学习模型的性能下降。经典的 learning 工具通常对输入协变量的小扰动敏感,对异常值和噪声敏感,有些工具是基于刚性的代数假设。由于生产原材料的变化、季节性、不同的用户群或甚至恶意攻击等原因,数据分布的变化经常发生。因此,有必要开发更有效的分布变化检测技术。 在这项工作中,我们提出了一个连续学习框架,用于监测和检测分布变化。我们在由生物启发的自组织聚类生成的潜在空间中研究这个问题。特别是,我们研究了两个保持拓扑不变的映射的投影:自组织映射和收缩不变映射。我们的方法可以在有监督和无监督两种情况下应用。我们对数据分布的变化进行评估,通过比较高斯信号,使所提出的方法快速且具有鲁棒性。我们将其与其它无监督技术(特别是主成分分析(PCA)和核聚类)进行比较。我们的比较包括使用图像序列(基于 MNIST 数据集并注入对抗样本)、化学传感器测量和与臭氧水平相关的环境变量进行的实验。实证研究揭示了所提出方法的优势。
https://arxiv.org/abs/2404.16656
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP2AM) to transfer knowledge at both prediction and prototype levels. RP$^2$AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.
在本文中,我们解决了仅使用预训练的孔洞图(source)和未标注的全景图像(target)进行无监督域适应(SFUDA)的问题,以实现孔洞到全景语义分割。解决这一问题是不简单的,因为存在三个关键挑战:1)不同域之间语义不匹配,2)源域问题中的风格差异,3)全景图像中不可避免的扭曲。为了应对这些问题,我们提出了360SFUDA++,它有效地从仅有的未标注全景图像中提取知识,并将可靠的知识传递到目标全景域。具体来说,我们首先利用切线投影(TP)作为它具有较少的扭曲,同时将等角投影(ERP)切成固定 FoV 投影(FFP)的补丁,以模仿孔洞图像。两个投影在提取知识方面都有效。然而,由于不同的投影使得域之间知识传递变得困难,我们 then 引入了可靠的全景原型适应模块(RP2AM),在预测和原型级别上传递知识。RP2AM 选择自信的知识,并整合全景原型以实现可靠的知识适应。此外,我们还引入了跨投影双重注意模块(CDAM),它更好地对域之间的特征水平进行投影之间的空间和通道特征的同步调整。知识提取和传递过程都被同步更新,以达到最佳性能。在合成和真实世界基准上的广泛实验,包括户外和室内场景,证明了我们的360SFUDA++在性能上显著优于前面的SFUDA方法。
https://arxiv.org/abs/2404.16501
The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at this https URL.
大多数无监督的三维物体检测方法遵循基于聚类的伪标签生成和迭代自训练过程。然而,由于激光雷达扫描的稀疏性,导致伪标签具有错误的大小和位置,从而导致检测性能不佳。为了解决这个问题,本文引入了一种以常识原型为基础的检测器,称为CPD,用于无监督三维物体检测。CPD首先基于常识直觉构建了高质量的边界框和密集点的高质量常识原型(CProto)。然后,CPD通过利用CProto的大小先验来优化低质量伪标签。此外,CPD通过CProto的几何知识提高了稀疏扫描对象检测的准确性。CPD在Waymo Open Dataset(WOD)、PandaSet和KITTI数据集上优于最先进的无监督三维检测器。此外,通过在WOD和KITTI上训练CPD并进行测试,CPD在容易和 moderate 车辆类别上获得了90.85%和81.01%的3D平均精度。这些成就使CPD与完全监督的检测器相接近,强调了我们的方法的重要性。代码将在该https URL上可用。
https://arxiv.org/abs/2404.16493
Unsupervised graph anomaly detection aims at identifying rare patterns that deviate from the majority in a graph without the aid of labels, which is important for a variety of real-world applications. Recent advances have utilized Graph Neural Networks (GNNs) to learn effective node representations by aggregating information from neighborhoods. This is motivated by the hypothesis that nodes in the graph tend to exhibit consistent behaviors with their neighborhoods. However, such consistency can be disrupted by graph anomalies in multiple ways. Most existing methods directly employ GNNs to learn representations, disregarding the negative impact of graph anomalies on GNNs, resulting in sub-optimal node representations and anomaly detection performance. While a few recent approaches have redesigned GNNs for graph anomaly detection under semi-supervised label guidance, how to address the adverse effects of graph anomalies on GNNs in unsupervised scenarios and learn effective representations for anomaly detection are still under-explored. To bridge this gap, in this paper, we propose a simple yet effective framework for Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection (G3AD). Specifically, G3AD introduces two auxiliary networks along with correlation constraints to guard the GNNs from inconsistent information encoding. Furthermore, G3AD introduces an adaptive caching module to guard the GNNs from solely reconstructing the observed data that contains anomalies. Extensive experiments demonstrate that our proposed G3AD can outperform seventeen state-of-the-art methods on both synthetic and real-world datasets.
无监督图异常检测旨在识别在图中与多数不同的罕见模式,而无需标签帮助,这对于各种现实应用场景具有重要意义。最近,人们利用图神经网络(GNNs)通过聚合图中的信息来学习有效的节点表示。这一假设基于一个结论,即图中节点通常会表现出与周围节点一致的行为。然而,图异常可能会以多种方式破坏这种一致性。大多数现有方法直接使用GNNs来学习表示,忽视了图异常对GNNs的负面影响,导致节点表示效果不佳和异常检测性能下降。虽然一些最近的方法为在半监督标签指导下去优化GNNs进行了重新设计,但如何解决图异常对GNNs在无监督场景下的不利影响以及如何学习有效的异常检测表示方法仍然是一个未探索的问题。为了填补这个空白,本文提出了一种简单的但有效的框架来保护无监督图神经网络免受异常的影响,即G3AD。具体来说,G3AD引入了两个辅助网络和相关约束来保护GNNs免受不一致信息编码。此外,G3AD还引入了自适应缓存模块来保护GNNs免受仅重构包含异常的观察数据。大量实验证明,我们提出的G3AD可以在 synthetic 和 real-world 数据上优于 17 个最先进的算法的表现。
https://arxiv.org/abs/2404.16366
Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
prompt learning已成为将大型预训练视觉语言模型(VLMs)适应下游任务的最具效力的范式。最近,无监督提示调整方法(如UPL和POUF)直接利用伪标签作为监督信息,在未标注数据上微调附加适应模块。然而,不准确的伪标签容易误导调整过程,导致表现不佳。针对这个问题,我们提出了 Training-Free Unsupervised Prompts (TFUP),在训练-免费和无标签的方式下,保留预训练模型的固有表示能力,并将其与基于相似度的预测概率的残差连接来增强表现。具体来说,我们将实例置信度和原型分数相结合,以选择具有代表性的样本,用于为无标签数据训练-免费的特征缓存模型(FCM)。然后,我们设计了一个多级相似度度量(MSM),考虑特征级别和语义级别的相似性,计算测试图像与缓存样本之间的距离,作为对应缓存标签的权重,生成基于相似度的预测概率。这样,TFUP 在多分类数据集上实现了惊人的表现,甚至超过了基于训练的方法。根据我们的TFUP,我们提出了一个基于训练的改进方法(TFUP-T),以进一步提高适应性能。除了标准的交叉熵损失外,TFUP-T还采用了一种额外的边际分布熵损失来约束模型从全局角度出发。我们的TFUP-T在多个基准数据集上的表现与无监督和少样本调整方法相当。特别是,TFUP-T 通过在最具挑战性的Domain-Net数据集上将POUF的分类准确率提高了3.3%而超过了该方法。
https://arxiv.org/abs/2404.16339
Unsupervised Domain Adaptation (UDA) refers to the method that utilizes annotated source domain data and unlabeled target domain data to train a model capable of generalizing to the target domain data. Domain discrepancy leads to a significant decrease in the performance of general network models trained on the source domain data when applied to the target domain. We introduce a straightforward approach to mitigate the domain discrepancy, which necessitates no additional parameter calculations and seamlessly integrates with self-training-based UDA methods. Through the transfer of the target domain style to the source domain in the latent feature space, the model is trained to prioritize the target domain style during the decision-making process. We tackle the problem at both the image-level and shallow feature map level by transferring the style information from the target domain to the source domain data. As a result, we obtain a model that exhibits superior performance on the target domain. Our method yields remarkable enhancements in the state-of-the-art performance for synthetic-to-real UDA tasks. For example, our proposed method attains a noteworthy UDA performance of 76.93 mIoU on the GTA->Cityscapes dataset, representing a notable improvement of +1.03 percentage points over the previous state-of-the-art results.
无监督领域适应(UDA)是指利用已标注的源域数据和未标注的目标域数据来训练一个能够泛化到目标域数据的模型。领域差异导致在将基于源域数据的通用网络模型应用于目标域数据时,模型的性能显著下降。我们引入了一种直接的方法来减轻领域差异,这不需要额外的参数计算,并无缝地与基于自训练的UDA方法集成。通过将目标域的风格信息传递到源域的潜在特征空间中,模型在决策过程中优先考虑目标域的风格。我们通过从目标域数据中传递样式信息来解决该问题。结果,我们在目标域上获得了卓越的性能。我们的方法在合成-真实UDA任务上的最先进性能有了显著提高。例如,与之前的结果相比,我们提出的UDA性能达到了+1.03%的显著提高。
https://arxiv.org/abs/2404.16301
When neural networks are confronted with unfamiliar data that deviate from their training set, this signifies a domain shift. While these networks output predictions on their inputs, they typically fail to account for their level of familiarity with these novel observations. This challenge becomes even more pronounced in resource-constrained settings, such as embedded systems or edge devices. To address such challenges, we aim to recalibrate a neural network's decision boundaries in relation to its cognizance of the data it observes, introducing an approach we coin as certainty distillation. While prevailing works navigate unsupervised domain adaptation (UDA) with the goal of curtailing model entropy, they unintentionally birth models that grapple with calibration inaccuracies - a dilemma we term the over-certainty phenomenon. In this paper, we probe the drawbacks of this traditional learning model. As a solution to the issue, we propose a UDA algorithm that not only augments accuracy but also assures model calibration, all while maintaining suitability for environments with limited computational resources.
当神经网络面对不熟悉的数据,这些数据与训练集存在偏差时,这表示领域发生了转移。虽然这些网络在输出其输入预测方面是有效的,但它们通常无法考虑到它们对这种新观察的熟悉程度。在资源受限的环境(如嵌入式系统或边缘设备)中,这个问题变得更加突出。为了应对这类挑战,我们旨在通过我们称之为确定性蒸馏的方法重新调整神经网络的决策边界与它所观察到的数据的认知之间的关系,从而提高模型的不确定性。虽然现有的工作通过无监督的领域适应(UDA)来尝试限制模型的熵,但它们无意中产生了那些挣扎于标定不准确性的模型 - 我们称之为过度信心现象。在本文中,我们探讨了这种传统学习模型的不足之处。为了解决问题,我们提出了一个UDA算法,该算法不仅增加了准确性,而且确保了模型的标定准确性,同时保持了对资源受限环境的适用性。
https://arxiv.org/abs/2404.16168
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
最近的工作发现,稀疏自动编码器(SAEs)是发现自然语言模型(LMs)激活的有用特征的有效技术,通过找到稀疏、线性的LM激活的稀疏重构。我们引入了门控稀疏自动编码器(Gated SSAE),它比现有的训练方法实现了帕累托改进。在SAE中,用于鼓励稀疏性的L1惩罚引入了许多不利的偏差,例如收缩——系统性地低估特征激活。Gated SSAE的关键洞察力是分离确定使用方向的功能和估计这些方向的大小:这使我们能够仅对前者应用L1惩罚,从而限制了不良影响范围。通过在具有7B参数的LM上训练SAEs,我们发现,在典型的超参数范围内,Gated SSAE解决了收缩,具有与SAEs相似的可解释性,并且需要一半的 firing特征才能实现与同质重构的相当的重建保真度。
https://arxiv.org/abs/2404.16014
Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed masking technique to effectively search for the most incongruent token to edit. Then it introduces four multi-aspect scoring functions to select edit action to further reduce search difficulty. Since PMCTG does not require supervised data, it could be applied to different generation tasks. We show that under the unsupervised setting, PMCTG achieves new state-of-the-art results in two representative tasks, namely keywords-to-sentence generation and paraphrasing.
无监督约束文本生成旨在生成满足给定约束条件的文本,而无需任何有监督数据。目前最先进的方法随机采样编辑位置和动作,这可能导致不必要的搜索步骤。在本文中,我们提出PMCTG来提高效果,通过在每一步中寻找最佳编辑位置和动作。具体来说,PMCTG扩展了扰动掩码技术,以有效搜索最不和谐的词。然后它引入了四个多方面评分函数,以选择编辑动作进一步减少搜索难度。由于PMCTG不需要有监督数据,因此可以应用于不同的生成任务。我们证明了在无监督设置下,PMCTG在两个具有代表性的任务(即关键词到句子生成和的同义词)上实现了与当前最佳方法相同的最新的最佳结果。
https://arxiv.org/abs/2404.15877
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at this https URL.
无监督领域适应(UDA)的目的是将来自标注源域的知识转移到未标注的目标域。最最新的UDA方法总是依赖于对抗性训练以获得最先进的成果和主导数量现有的UDA方法采用卷积神经网络(CNN)作为特征提取器来学习域不变的特征。自Vision Transformer(ViT) emergence以来,已经引起了巨大的关注,并在各种计算机视觉任务中得到了广泛应用,然而其在对抗领域适应性的潜在能量从未被研究。在本文中,我们通过将ViT作为对抗领域适应的特征提取器来填补这一空白。此外,我们通过实验实证证明,ViT可以成为对抗领域适应的一个插件,这意味着在现有UDA方法中,将基于CNN的特征提取器直接替换为ViT的特征提取器可以轻松获得性能提升。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2404.15817
Weakly-supervised diffusion models (DM) in anomaly segmentation, leveraging image-level labels, have attracted significant attention for their superior performance compared to unsupervised methods. It eliminates the need for pixel-level labels in training, offering a more cost-effective alternative to supervised methods. However, existing methods are not fully weakly-supervised because they heavily rely on costly pixel-level labels for hyperparameter tuning in inference. To tackle this challenge, we introduce Anomaly Segmentation with Forward Process of Diffusion Models (AnoFPDM), a fully weakly-supervised framework that operates without the need for pixel-level labels. Leveraging the unguided forward process as a reference, we identify suitable hyperparameters, i.e., noise scale and threshold, for each input image. We aggregate anomaly maps from each step in the forward process, enhancing the signal strength of anomalous regions. Remarkably, our proposed method outperforms recent state-of-the-art weakly-supervised approaches, even without utilizing pixel-level labels.
在异常分割中,利用图像级标签弱监督扩散模型(DM)吸引了广泛的关注,因为它们与无监督方法的优越性能。它消除了在训练过程中需要像素级标签的需求,为监督方法提供了一种更经济有效的替代方案。然而,现有的方法并不是完全弱监督的,因为它们在推理过程中严重依赖昂贵的像素级标签进行超参数调整。为了解决这个挑战,我们引入了异常分割扩散模型前向过程(AnoFPDM),一种完全弱监督的框架,不需要像素级标签。利用无引导的前向过程作为参考,我们确定每个输入图像的合适超参数,即噪声比例和阈值。我们从前向过程的每个步骤中累积异常图,增强异常区域的信号强度。值得注意的是,与最先进的弱监督方法相比,我们的方法甚至在没有使用像素级标签的情况下表现出色。
https://arxiv.org/abs/2404.15683
The advancement of The Laser Interferometer Gravitational-Wave Observatory (LIGO) has significantly enhanced the feasibility and reliability of gravitational wave detection. However, LIGO's high sensitivity makes it susceptible to transient noises known as glitches, which necessitate effective differentiation from real gravitational wave signals. Traditional approaches predominantly employ fully supervised or semi-supervised algorithms for the task of glitch classification and clustering. In the future task of identifying and classifying glitches across main and auxiliary channels, it is impractical to build a dataset with manually labeled ground-truth. In addition, the patterns of glitches can vary with time, generating new glitches without manual labels. In response to this challenge, we introduce the Cross-Temporal Spectrogram Autoencoder (CTSAE), a pioneering unsupervised method for the dimensionality reduction and clustering of gravitational wave glitches. CTSAE integrates a novel four-branch autoencoder with a hybrid of Convolutional Neural Networks (CNN) and Vision Transformers (ViT). To further extract features across multi-branches, we introduce a novel multi-branch fusion method using the CLS (Class) token. Our model, trained and evaluated on the GravitySpy O3 dataset on the main channel, demonstrates superior performance in clustering tasks when compared to state-of-the-art semi-supervised learning methods. To the best of our knowledge, CTSAE represents the first unsupervised approach tailored specifically for clustering LIGO data, marking a significant step forward in the field of gravitational wave research. The code of this paper is available at this https URL
LIGO's advancements have significantly enhanced the feasibility and reliability of gravitational wave detection. However, LIGO's high sensitivity makes it susceptible to transient noises called glitches, which necessitate effective differentiation from real gravitational wave signals. Traditional approaches predominantly employ fully supervised or semi-supervised algorithms for the task of glitch classification and clustering. In the future task of identifying and classifying glitches across main and auxiliary channels, it is impractical to build a dataset with manually labeled ground-truth. In addition, the patterns of glitches can vary with time, generating new glitches without manual labels. In response to this challenge, we introduce the Cross-Temporal Spectrogram Autoencoder (CTSAE), a pioneering unsupervised method for the dimensionality reduction and clustering of gravitational wave glitches. CTSAE integrates a novel four-branch autoencoder with a hybrid of Convolutional Neural Networks (CNN) and Vision Transformers (ViT). To further extract features across multi-branches, we introduce a novel multi-branch fusion method using the CLS (Class) token. Our model, trained and evaluated on the GravitySpy O3 dataset on the main channel, demonstrates superior performance in clustering tasks when compared to state-of-the-art semi-supervised learning methods. To the best of our knowledge, CTSAE represents the first unsupervised approach tailored specifically for clustering LIGO data, marking a significant step forward in the field of gravitational wave research. The code of this paper is available at this [https:// URL](https:// URL).
https://arxiv.org/abs/2404.15552
Unsupervised clustering of wafer map defect patterns is challenging because the appearance of certain defect patterns varies significantly. This includes changing shape, location, density, and rotation of the defect area on the wafer. We present a harvesting approach, which can cluster even challenging defect patterns of wafer maps well. Our approach makes use of a well-known, three-step procedure: feature extraction, dimension reduction, and clustering. The novelty in our approach lies in repeating dimensionality reduction and clustering iteratively while filtering out one cluster per iteration according to its silhouette score. This method leads to an improvement of clustering performance in general and is especially useful for difficult defect patterns. The low computational effort allows for a quick assessment of large datasets and can be used to support manual labeling efforts. We benchmark against related approaches from the literature and show improved results on a real-world industrial dataset.
未经监督的芯片图缺陷模式聚类具有挑战性,因为某些缺陷模式的显现存在显著差异。这包括缺陷区域的形状、位置、密度和旋转的变化。我们提出了一个收获方法,可以对具有挑战性的芯片图缺陷模式进行良好的聚类。我们方法的关键在于通过迭代进行维度降维和聚类,并在每个迭代过程中根据轮廓评分排除一个聚类。这种方法在一般聚类性能方面提高了聚类性能,尤其是在难以分类的缺陷模式上。较低的计算开销允许快速评估大型数据集,并可用于支持手动标注努力。我们与文献中相关的 approaches 进行了基准测试,并在真实工业数据集上取得了更好的结果。
https://arxiv.org/abs/2404.15436
When deploying pre-trained video object detectors in real-world scenarios, the domain gap between training and testing data caused by adverse image conditions often leads to performance degradation. Addressing this issue becomes particularly challenging when only the pre-trained model and degraded videos are available. Although various source-free domain adaptation (SFDA) methods have been proposed for single-frame object detectors, SFDA for video object detection (VOD) remains unexplored. Moreover, most unsupervised domain adaptation works for object detection rely on two-stage detectors, while SFDA for one-stage detectors, which are more vulnerable to fine-tuning, is not well addressed in the literature. In this paper, we propose Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT), a simple yet effective SFDA method for VOD. Specifically, we aim to improve the performance of the one-stage VOD method, YOLOV, under adverse image conditions, including noise, air turbulence, and haze. Extensive experiments on the ImageNetVOD dataset and its degraded versions demonstrate that our method consistently improves video object detection performance in challenging imaging conditions, showcasing its potential for real-world applications.
在将预训练的视频物体检测器应用于现实场景时,训练和测试数据之间的领域差异会导致性能下降。当只有预训练模型和降级视频可用时,解决此问题变得尤为具有挑战性。尽管已经提出了各种源域免费域适应(SFDA)方法用于单帧物体检测器,但视频物体检测器(VOD)的SFDA仍然没有被探索。此外,大多数无监督域适应方法,这些方法在物体检测中依赖于两阶段检测器,而我们的SFDA方法针对一阶段检测器,这些检测器更容易受到微调的影响,在文献中没有得到很好的解决。在本文中,我们提出了Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT),一种简单而有效的SFDA方法用于VOD。具体来说,我们旨在改善在恶劣图像条件下,包括噪声、气流和雾的,预训练的VOD方法YOLOV的性能。对ImageNetVOD数据集及其降本版本的大规模实验证明,我们的方法在具有挑战性的图像条件下持续改善视频物体检测器的性能,表明其具有在现实应用中的潜在。
https://arxiv.org/abs/2404.15252
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. Using only (1) a well-defined API schema (2) a set of unlabelled dialogues between a user and agent, we develop a novel approach for inferring turn-level annotations as latent variables using a noisy channel model. We iteratively improve these pseudo-labels with expectation-maximization (EM), and use the inferred labels to train an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
基于任务的对话系统通常需要进行交互级别的注释,例如对话状态和每个步骤系统采取的行动。这些注释可能会产生费用,具有错误率,并且需要领域和注释专业知识。随着LLM的进步,我们假设无标签数据和数据定义足以构建一个无需监督的 task-oriented 对话系统。仅使用(1)定义良好的 API 模式和(2)用户和代理之间的无标签对话,我们提出了一种通过噪声信道模型推断回合级别注释的新方法。我们通过期望最大化(EM)迭代改进这些伪标签,并使用推断的标签来训练端到端对话代理。在 MultiWOZ 基准上评估我们的方法,我们的方法将 strong GPT-3.5 基线的对话成功率加倍。
https://arxiv.org/abs/2404.15219
This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.
本文关注在动态场景中自监督单目深度估计。现有的方法主要依靠图像重建损失来估计像素级的深度和运动,但由于深度和运动估计的固有不确定性,导致准确度不准确。本文提出了一种利用训练数据中动态区域的伪深度标签进行自监督训练的框架。我们提出了一种将图像训练数据中静态区域和动态区域的深度估计解耦的方法。我们的框架的关键贡献是解耦训练数据中静态和动态区域的深度估计。我们首先采用无监督的深度估计方法,为静态区域提供可靠的深度估计,并允许我们在实例级别提取移动物体信息。在下一阶段,我们使用物体网络估计假设刚性运动的动态对象的深度。然后,我们提出了一种新的尺度对齐模块来解决估计深度静态和动态区域之间的尺度不确定性。我们可以然后使用生成的深度标签来训练端到端深度估计算法,并提高其性能。在Cityscapes和KITTI数据集上的实验表明,我们的自训练策略 consistently优于现有的自/无监督深度估计方法。
https://arxiv.org/abs/2404.14908
Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
阻塞是在实体识别过程中一个关键的步骤,基于神经网络的表示模型的出现使得密集阻塞作为一种探索深度语义的有效方法而受到关注。然而,之前的高级自监督密集阻塞方法需要针对目标域进行领域特定的训练,这限制了这些方法的好处和快速适应能力。为了解决这个问题,我们提出了UBlocker,一种在自监督对比学习的基础上预训练于无领域无关、易于获取的表格语料库的密集阻塞方法。通过进行无领域的预训练,UBlocker可以适应各种下游阻塞场景,而无需进行领域特定的微调。为了评估我们实体阻塞器的普适性,我们还构建了一个新的基准,涵盖了多个领域和场景的广泛阻塞任务。我们的实验结果表明,与没有进行任何领域特定学习相比,所提出的UBlocker在阻塞任务中显著超过了以前的自监督和无监督密集阻塞方法,与最先进的稀疏阻塞方法相当,互补且具有优势。
https://arxiv.org/abs/2404.14831
Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a pre-trained deep network. This mechanism starts with easy samples and progressively moves towards harder ones, to avoid initial interference caused by hard samples. Afterwards, the obtained saliency cues are utilized to train a saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning method is devised to transfer the acquired saliency knowledge, leveraging shared knowledge to attain superior transferring performance on the target tasks. Extensive experiments on five representative SOD tasks confirm the effectiveness and feasibility of our proposed method. Code and supplement materials are available at this https URL.
近年来,无监督显著物体检测(USOD)因其无需标注的特点而受到越来越多的关注。然而,目前的方法主要关注特定任务,如RGB和RGB-D,而忽略了潜在的任务迁移可能性。在本文中,我们提出了一种通用的USOD框架,用于通用USOD任务。首先,我们提出了一种基于渐进式课程学习的不透明度蒸馏(PCL-SD)机制,从预训练的深度网络中提取 saliency 线索。这种机制从容易的样本开始,逐渐转移到困难的样本,以避免由困难样本引起的初始干扰。接下来,获得的 saliency 线索用于训练 saliency 检测器,并采用自校正伪标签优化(SPR)机制来提高伪标签的质量。最后,一种适配器调整方法被提出,以转移获得的 saliency 知识,利用共享知识在目标任务上实现卓越的传输性能。在五个代表性SOD任务上进行的大量实验证实了我们提出方法的有效性和可行性。代码和补充材料可在此链接下载:https://www.example.com/
https://arxiv.org/abs/2404.14759