Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
机器学习消退已成为一个突出且具有挑战性的兴趣领域,很大程度上是由对行业在收到请求时删除用户数据的需求增加和隐私意识增强而催生的。现有的方法不是从头重新训练模型,就是对每个删除请求使用多个微调步骤,往往受到计算资源限制和原始训练数据访问的限制。在这项工作中,我们引入了一种名为策略消除类消退的新消退算法,旨在有意识地从已学到的模型中消除整个类或一个类。为此,我们的算法首先估计保留空间和忘记空间,分别表示保留和消退的样本特征或激活空间。为了获得这些空间,我们提出了一种基于层归一化的新的单例值分解技术,该技术要求从网络的前几层中收集网络活动。然后计算这些空间之间的共享信息并将其从忘记空间中移除,以隔离消退类别的特征空间。最后,我们将模型权重在类归一化空间的垂直方向上投影以获得消退模型。我们在ImageNet上使用仅比原始模型保留准确性约下降1.5%的Vision Transformer来验证我们算法的有效性,同时将消退类样本的准确性保持在不到1%的水平。此外,我们的算法在受到成员推断攻击时表现出色,平均将在各种图像分类数据集和网络架构上的准确性提高7.8%,相对于其他基线,同时实现约6倍于其他算法的计算效率。
https://arxiv.org/abs/2312.00761
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
像CLIP这样的视觉语言预训练在各种下游任务上的表现都相当出色,例如零散射击图像分类和图像-文本检索。大多数现有的CLIP类似作品通常采用较大的图像编码器,如ResNet50和ViT,而轻量级的版本很少被讨论。在本文中,我们提出了一个多级交互范式来训练轻量级CLIP模型。首先,为了减轻一些图像-文本对不是严格一对一对应的问题,我们通过逐渐软化负样本的标签来改进传统的全局实例级对齐目标。其次,引入了一个基于标记的轻量化二元匹配的平滑二元匹配基于词级对齐目标,用于对图像补丁和文本单词进行细粒度对齐。此外,根据观察到CLIP模型的准确性不会随着文本编码器参数的增加而相应增加,我们引入了遮蔽语言建模(MLM)的额外目标,以提高缩短的文本编码器的潜力。在实践中,我们提出了一个在网络阶段为遮蔽图像嵌入注入的辅助融合模块,以增强MLM。大量实验证明,在无需在推理过程中引入额外计算成本的情况下,所提出的方法在多个下游任务上取得了更高的性能。
https://arxiv.org/abs/2312.00674
Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.
扩散模型已经在生成数据用于感知任务(如图像分类和目标检测)中取得了突出地位。然而,在生成高质量跟踪序列方面,在视频感知领域中还没有完全研究。为了填补这一空白,我们提出了TrackDiffusion,一种新架构,旨在从跟踪器中生成连续视频序列。TrackDiffusion在传统布局到图像(L2I)生成和复制的合成中取得了显著的突破,通过使图像扩散模型涵盖动态和连续跟踪轨迹,从而捕捉复杂运动细节并确保视频帧之间的实例一致性。为了首次证明,生成的视频序列可以用于训练多目标跟踪(MOT)系统,从而显著提高跟踪器性能。实验结果表明,我们的模型在生成视频序列方面显著提高了实例一致性,从而提高了感知指标。我们的方法在YTVIS数据集上的TrackAP和TrackAP$_{50}分别提高了8.7和11.8,强调了其重新定义视频数据生成标准,适用于MOT任务以及更广泛的视频感知应用。
https://arxiv.org/abs/2312.00651
Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at this https URL
正常化技术在深度学习领域得到了广泛应用,因为它们能够实现更高的学习率,而且在初始化时较为简单。然而,流行的高正常化技术的效果通常局限于特定的应用领域。与标准的大规模批均值和方差计算(BN)和层均值和方差计算(LN)不同,它们沿着(N,C,H和W)维度计算均值和方差。本文提出了一种名为批通道正常化(BCN)的新正常化技术。为了利用通道和批次的依赖性,并基于特定数据集或任务自适应地结合BN和LN的优点,BCN分别沿着(N,H,W)和(C,H,W)维度对输入进行正常化,然后根据自适应参数将正常化的输出进行组合。作为基本单元,BCN可以轻松地集成到各种计算机视觉领域的模型中。通过实验结果表明,所提出的技术可以平滑地应用于各种版本的CNN或Transformer架构。代码可在此处公开获取:https://url
https://arxiv.org/abs/2312.00596
Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
由于在许多视觉任务中的出色表现,Transformer Vision模型已经引起了很大的关注。尽管在token mixer或attention block上已经进行了详细研究,但通道混合器或特征混合块(FFN或MLP)尚未深入研究,尽管它占据了模型中大部分的参数和计算。在本文中,我们研究是否稀疏特征混合可以取代密集连接,并通过支持更大的扩展比来证实这一结论。为了提高由该结构形成的特征簇的准确性,在训练过程中引入了一个轻量级、参数无关的通道协方差注意(CCA)机制作为并行分支。该设计的CCA允许在训练过程中逐步混合通道组,其贡献在训练达到收敛时逐渐消失。这使得CCA块在推理时可以被丢弃,从而实现在不增加计算成本的情况下提高性能。通过控制MLP中块的扩展比,可以将得到的具有不同复杂度和性能的模型插接到任何ViT架构中。这通过引入一个新的SCHEMEformer模型家族来证明。在不同的ViT骨干网络、图像分类、目标检测和语义分割实验中,与现有设计相比,具有显著的准确性提升,尤其是在较低的FLOPs条件下。例如,SCHEMEformer在ImageNet-1K上使用纯注意力混合器建立了79.7%的准确率的新SOTA。
https://arxiv.org/abs/2312.00412
Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research.
主动学习旨在通过策略地标记有用的数据点来增强模型的性能。虽然它在许多研究中进行了广泛研究,但在大型现实世界数据集上的效果仍然没有被充分探讨。现有研究主要集中在单一数据源上,忽略了现实世界中数据的多样性。我们引入了一个多领域主动学习基准来填补这一空白。我们的基准表明,传统单一领域主动学习策略在多领域场景中通常效果不如随机选择。我们还引入了CLIP-GeoYFCC,一个以地理域为中心的大型图像数据集,与现有的基于类别的域数据集不同。分析我们基准的结果表明,所有多领域策略都存在显著的权衡,没有策略在所有数据集或所有指标上优于其他策略,强调了未来研究的必要性。
https://arxiv.org/abs/2312.00364
Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.
Steins paradox 在高维统计中具有相当大的影响力,表明传统上被认为是实质估价器的样本均值可能不是在更高维中最为有效的。为解决这个问题,詹姆斯 - 斯坦恩估计器提出了一种通过将样本均值引导向更集中均值向量来增强的方法。在本文中,首先,我们证明了深度学习中的归一化层使用不合法的均值和方差估价器。接下来,我们引入了一种新方法,使用詹姆斯 - 斯坦恩估计器来改进归一化层中均值和方差的估计。我们对不同的计算机视觉任务进行评估:图像分类、语义分割和 3D 对象分类。通过这些评估,可以很明显地看出我们改进的归一化层在所有任务上都具有卓越的准确性和不需要额外的计算负担。此外,认识到许多收缩估计器在性能上超过了传统估计器,我们研究了另外两个重要的收缩估计器:Ridge 和 LASSO。此外,我们还提供了可视化表示,直观地展示了收缩对估计层统计量的影响。最后,我们研究了正则化和批量大小的影响,我们的修改后的批均正常化方法在不同设置下的准确性得到提高。研究表明,我们的方法对于批量大小的变化和正则化的影响较小,从而提高了准确性。
https://arxiv.org/abs/2312.00313
In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a curriculum-based pre-training approach designed to rival traditional pre-training techniques that are data-hungry. These training approaches also introduce unnecessary features that could be misleading when the network is employed in a downstream classification task where the data is sufficiently different from the pre-training data and is scarce. We design the curriculum for DPT by drawing inspiration from human infant visual development. DPT employs a phased approach where carefully-selected primitive and universal features like edges and shapes are taught to the network participating in our pre-training regime. A model that underwent the DPT regime is tested against models with randomised weights to evaluate the viability of DPT.
在数据需求不断增加的深度神经网络物体识别任务日益难以承受的情况下,我们提出了发展性预训练(DPT)作为可能的解决方案。DPT 是一种以课程为基础的预训练方法,旨在与数据饥饿的传统预训练技术相竞争。这些训练方法还引入了不必要的特征,当网络用于下游分类任务时,这些数据可能与预训练数据存在较大差异且数据不足时,可能会误导网络。我们通过从婴儿视觉发展中获得灵感来设计 DPT 的课程。DPT 使用阶段式方法,仔细选择基本和通用特征,如边缘和形状,来教授参与预训练的网络。经过 DPT 训练的模型被用于与随机权重模型进行比较,以评估 DPT 的可行性。
https://arxiv.org/abs/2312.00304
In this paper, we introduce an explainable algorithm designed from a multi-modal foundation model, that performs fast and explainable image classification. Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs), our method creates a latent space where each neuron is linked to a specific word. Observing that this latent space can be modeled with simple distributions, we use a Mixture of Gaussians (MoG) formalism to enhance the interpretability of this latent space. Then, we introduce CLIP-QDA, a classifier that only uses statistical values to infer labels from the concepts. In addition, this formalism allows for both local and global explanations. These explanations come from the inner design of our architecture, our work is part of a new family of greybox models, combining performances of opaque foundation models and the interpretability of transparent models. Our empirical findings show that in instances where the MoG assumption holds, CLIP-QDA achieves similar accuracy with state-of-the-art methods CBMs. Our explanations compete with existing XAI methods while being faster to compute.
在本文中,我们介绍了一种基于多模态基础模型的可解释算法,该算法执行快速且可解释的图像分类。从基于CLIP的 Concept Bottleneck Models (CBMs) 的灵感中,我们的方法创建了一个隐含空间,其中每个神经元都与特定的单词相关联。观察到这个隐含空间可以用简单的分布来建模,我们使用Mixture of Gaussians(MoG)形式来增强这个隐含空间的解释性。然后,我们引入了CLIP-QDA,一种仅使用统计值来推断概念的分类器。此外,这种形式允许 both local and global explanations。这些解释来自我们架构的内在设计,我们的工作属于一种新型的灰色盒模型,结合了不可见基础模型的性能和透明模型的可解释性。我们的实证研究结果表明,在MoG假设成立的情况下,CLIP-QDA与最先进的CBMs具有类似的准确率。我们的解释在计算速度上与现有XAI方法相竞争,但同时更快速。
https://arxiv.org/abs/2312.00110
Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.
近年来,工作展示了将文本到图像扩散模型定制到多个细粒度概念并在序列(即连续)方式下进行定制,而只需为每个概念提供几张示例图像的显著能力。这个设置被称为连续扩散。在这里,我们问一个问题:我们能否将这种方法扩展到更长的概念序列,而不遗忘?尽管先前的研究减轻了之前学习概念的遗忘,但我们证明了在更长的序列中学习新任务的能力已经达到了饱和。为了解决这个问题,我们引入了一种新的方法:STack-And-Mask INcremental Adapters(STAMINA),它由低秩注意力掩码的适应器组成,并定制了自定义的MLP标记。STAMINA旨在通过可学习的高注意度掩码参数化低秩MLP,增强LoRA在序列概念学习中的鲁棒性,实现精确、可扩展的学习,通过稀疏适应。值得注意的是,所有引入的可训练参数在训练后都可以折叠回到模型中,从而不会产生额外的推理参数成本。我们在包含50个概念的文本到图像连续定制设置上评估了STAMINA,发现它在该设置上超过了先前的SOTA。此外,我们还将我们的方法扩展到图像分类的连续学习设置中,证明了我们的收获同样可以应用于这个标准基准。
https://arxiv.org/abs/2311.18763
Prototypical-part interpretable methods, e.g., ProtoPNet, enhance interpretability by connecting classification predictions to class-specific training prototypes, thereby offering an intuitive insight into their decision-making. Current methods rely on a discriminative classifier trained with point-based learning techniques that provide specific values for prototypes. Such prototypes have relatively low representation power due to their sparsity and potential redundancy, with each prototype containing no variability measure. In this paper, we present a new generative learning of prototype distributions, named Mixture of Gaussian-distributed Prototypes (MGProto), which are represented by Gaussian mixture models (GMM). Such an approach enables the learning of more powerful prototype representations since each learned prototype will own a measure of variability, which naturally reduces the sparsity given the spread of the distribution around each prototype, and we also integrate a prototype diversity objective function into the GMM optimisation to reduce redundancy. Incidentally, the generative nature of MGProto offers a new and effective way for detecting out-of-distribution samples. To improve the compactness of MGProto, we further propose to prune Gaussian-distributed prototypes with a low prior. Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and Oxford-IIIT Pets datasets show that MGProto achieves state-of-the-art classification and OoD detection performances with encouraging interpretability results.
prototypical-part interpretable methods, such as ProtoPNet, enhance interpretability by connecting classification predictions to class-specific training prototypes, thereby offering an intuitive insight into their decision-making. 当前的方法依赖于用基于点学习技术训练的具有特定值的目标分类器,这样的原型由于稀疏性和可能的重叠而具有较低的代表力,每个原型都不包含任何可变性度量。在本文中,我们提出了一个新的原型分布生成学习方法,名为Mixture of Gaussian-distributed Prototypes (MGProto),用高斯混合模型(GMM)表示。这种方法使得学习更有力的原型表示,因为每个学习到的原型将拥有一个度量,这自然地减少了每个原型的稀疏性,并且我们还将原型多样性目标函数集成到GMM优化中,以减少冗余。值得一提的是,MGProto的生成性质提供了一种新的有效的方法来检测离群样本。为了提高MGProto的紧凑性,我们进一步提出了用较低置信度的GMM进行剪枝。在CUB-200-2011,斯坦福汽车,斯坦福狗和牛津-IIIT宠物数据集上的实验表明,MGProto实现了与最先进的分类和OoD检测性能,具有令人鼓舞的可解释性结果。
https://arxiv.org/abs/2312.00092
Graph neural networks (GNNs) present a promising alternative to CNNs and transformers in certain image processing applications due to their parameter-efficiency in modeling spatial relationships. Currently, a major area of research involves the converting non-graph input data for GNN-based models, notably in scenarios where the data originates from images. One approach involves converting images into nodes by identifying significant keypoints within them. Super-Retina, a semi-supervised technique, has been utilized for detecting keypoints in retinal images. However, its limitations lie in the dependency on a small initial set of ground truth keypoints, which is progressively expanded to detect more keypoints. Having encountered difficulties in detecting consistent initial keypoints in brain images using SIFT and LoFTR, we proposed a new approach: radiomic feature-based keypoint detection. Demonstrating the anatomical significance of the detected keypoints was achieved by showcasing their efficacy in improving registration processes guided by these keypoints. Subsequently, these keypoints were employed as the ground truth for the keypoint detection method (LK-SuperRetina). Furthermore, the study showcases the application of GNNs in image matching, highlighting their superior performance in terms of both the number of good matches and confidence scores. This research sets the stage for expanding GNN applications into various other applications, including but not limited to image classification, segmentation, and registration.
图神经网络(GNNs)在某些图像处理应用中表现出了与CNNs和Transformer的参数效率相媲美的有前景的替代方案,这是由于其在建模空间关系方面的优势。目前,一个主要的研究领域涉及将基于GNN的模型的非图输入数据进行转换,特别是当数据源于图像时。一种方法是通过在图像中识别关键点来将图像转换为节点。半监督技术Super-Retina已被用于检测视网膜图像中的关键点。然而,其局限性在于它依赖于一个较小的初始集地面真实关键点,并且这些关键点在逐渐扩展以检测更多关键点时逐渐出现。由于使用SIFT和LoFTR检测初始关键点时遇到了困难,我们提出了一个新的方法:基于辐射特征的关键词点检测。通过展示其对关键点的检测方法引导下的注册过程的改善效果,证明了检测到的关键点在解剖学上具有重要性。随后,这些关键点被用作关键词点检测方法的地面真实值。此外,本研究展示了GNN在图像匹配中的应用,突出了其在数量好的匹配和置信分数方面的优越性能。这项研究为将GNN应用于各种其他领域,包括但不仅限于图像分类、分割和注册奠定了基础。
https://arxiv.org/abs/2311.18281
Continual test-time adaptation (cTTA) methods are designed to facilitate the continual adaptation of models to dynamically changing real-world environments where computational resources are limited. Due to this inherent limitation, existing approaches fail to simultaneously achieve accuracy and efficiency. In detail, when using a single image, the instability caused by batch normalization layers and entropy loss significantly destabilizes many existing methods in real-world cTTA scenarios. To overcome these challenges, we present BESTTA, a novel single image continual test-time adaptation method guided by style transfer, which enables stable and efficient adaptation to the target environment by transferring the style of the input image to the source style. To implement the proposed method, we devise BeIN, a simple yet powerful normalization method, along with the style-guided losses. We demonstrate that BESTTA effectively adapts to the continually changing target environment, leveraging only a single image on both semantic segmentation and image classification tasks. Remarkably, despite training only two parameters in a BeIN layer consuming the least memory, BESTTA outperforms existing state-of-the-art methods in terms of performance.
持续时间适应(cTTA)方法旨在通过有限计算资源来促进模型在不断变化的现实环境中的持续适应。由于这种固有限制,现有的方法无法同时实现准确性和效率。具体来说,在单张图像上使用批归一化层引起的不稳定性和熵损失会显著破坏许多现实世界的cTTA场景中的现有方法。为了克服这些挑战,我们提出了BESTTA,一种基于风格迁移的单图像持续时间适应方法,它通过将输入图像的风格转移到源风格来实现稳定和高效的适应目标环境。为了实现所提出的 method,我们设计了一个简单而强大的归一化方法BeIN,以及引导风格损失的样式指导损失。我们证明了BESTTA能够有效适应不断变化的 target 环境,仅使用一张图片在语义分割和图像分类任务上。值得注意的是,尽管在BeIN层中仅训练了两个参数,但BESTTA在性能上超过了现有最先进的方法。
https://arxiv.org/abs/2311.18270
In many practical computer vision scenarios unlabeled data is plentiful, but labels are scarce and difficult to obtain. As a result, semi-supervised learning which leverages unlabeled data to boost the performance of supervised classifiers have received significant attention in recent literature. One major class of semi-supervised algorithms is co-training. In co-training two different models leverage different independent and sufficient "views" of the data to jointly make better predictions. During co-training each model creates pseudo labels on unlabeled points which are used to improve the other model. We show that in the common case when independent views are not available we can construct such views inexpensively using pre-trained models. Co-training on the constructed views yields a performance improvement over any of the individual views we construct and performance comparable with recent approaches in semi-supervised learning, but has some undesirable properties. To alleviate the issues present with co-training we present Meta Co-Training which is an extension of the successful Meta Pseudo Labels approach to multiple views. Our method achieves new state-of-the-art performance on ImageNet-10% with very few training resources, as well as outperforming prior semi-supervised work on several other fine-grained image classification datasets.
在许多实际计算机视觉场景中,有大量的未标注数据,但标签稀缺且难以获得。因此,在最近的文章中,利用未标注数据来提高监督分类器性能的半监督学习方法受到了广泛关注。一类半监督算法是联合训练。在联合训练中,两个不同的模型利用不同的独立和充分的“数据视角”共同做出更好的预测。在联合训练期间,每个模型在未标注点上创建伪标签,用于改善另一个模型。我们证明了,在独立视角不可用的情况下,我们可以使用预训练模型以经济的方式构建这种视角。基于构建的视角的联合训练带来了性能提升,超过我们构建的任何单独视角,并且在半监督学习方面的性能与最近的方法相当,但具有某些不良特性。为了减轻联合训练中出现的问题,我们提出了元联合训练,这是成功元伪标签方法在多个视角的扩展。我们的方法在ImageNet上实现了与初始半监督工作相当但训练资源非常有限的新状态-of-the-art性能,同时在几个其他细粒度图像分类数据集上超过了先前的半监督工作。
https://arxiv.org/abs/2311.18083
While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (this https URL) and code (this https URL) are available publicly.
虽然许多无监督学习模型集中于一个或两个家族的任务,无论是生成还是判别性的,但我们探索了统一表示学习者的可能性:一个同时处理这两个家族任务的模型。我们选中了扩散模型,这是一种最先进的生成任务方法,作为潜在的统一表示学习者的候选者。这类模型涉及训练一个U-Net以迭代预测和删除噪声,并因此得到的模型可以合成高保真度、多样、新颖的图像。我们发现U-Net的中间特征图具有多样性、判别性特征表示。我们提出了一个新的关注机制用于池化特征图,并利用这个机制作为DifFormer,一种来自不同扩散U-Net块和噪声步骤的特征变换器。我们还开发了DifFeed,一种专为扩散设计的反馈机制。我们发现,扩散模型比GAN更好,并且,通过我们的融合和反馈机制,可以与最先进的无监督图像表示学习方法竞争,这些方法可以用于全监督和半监督任务——包括带全和半监督的图像分类、微细化分类、目标检测和分割,以及语义分割。我们的项目网站(此https URL)和代码(此https URL)都是公开的。
https://arxiv.org/abs/2311.17921
Deep neural networks, while powerful for image classification, often operate as "black boxes," complicating the understanding of their decision-making processes. Various explanation methods, particularly those generating saliency maps, aim to address this challenge. However, the inconsistency issues of faithfulness metrics hinder reliable benchmarking of explanation methods. This paper employs an approach inspired by psychometrics, utilizing Krippendorf's alpha to quantify the benchmark reliability of post-hoc methods in image classification. The study proposes model training modifications, including feeding perturbed samples and employing focal loss, to enhance robustness and calibration. Empirical evaluations demonstrate significant improvements in benchmark reliability across metrics, datasets, and post-hoc methods. This pioneering work establishes a foundation for more reliable evaluation practices in the realm of post-hoc explanation methods, emphasizing the importance of model robustness in the assessment process.
深度神经网络在图像分类任务中具有强大的表现,但通常被视为“黑盒子”,这使得对它们决策过程的理解变得更加复杂。为了解决这个问题,各种解释方法,特别是那些产生轮廓图的方法,试图解决这个挑战。然而,信仰度指标的不一致性问题阻碍了可靠基准测试的解释方法。本文采用了一种受到心理测量学启发的方法,使用Krippendorf的alpha来量化图像分类后方法的基准可靠性。研究提出了包括喂入扰动样本和采用焦点损失等模型训练修改,以增强稳健性和可调性。实证评估表明,在指标、数据集和后处理方法上,基准可靠性都有显著提高。这项开创性的工作为后处理解释方法领域更可靠的评估实践奠定了基础,强调了在评估过程中模型稳健性的重要性。
https://arxiv.org/abs/2311.17876
While deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call into question how reliably these classifiers work in the wild. Furthermore, for safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently. In this paper, we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the behavior and decisions of image classifiers by visual counterfactual explanations (VCEs), detection of systematic mistakes by analyzing images where classifiers maximally disagree, and visualization of neurons to verify potential spurious features. In this way, we validate existing observations, e.g. the shape bias of adversarially robust models, as well as novel failure modes, e.g. systematic errors of zero-shot CLIP classifiers, or identify harmful spurious features. Moreover, our VCEs outperform previous work while being more versatile.
虽然深度学习在像ImageNet这样复杂的图像分类任务中取得了巨大的进展,但意外的失败模式,例如通过伪特征,质疑了这些分类器在野外的可靠性。此外,对于关键任务,它们决策的黑色盒性质令人担忧,需要紧急解释或至少使决策合理的方法。在本文中,我们通过使用引导图像生成的框架生成图像来解决这些问题。我们通过视觉反事实解释(VCEs)分析图像分类器的行为和决策,通过分析类决策者最大分歧的图像来检测系统性的错误,并通过可视化神经元来验证潜在的伪特征。这样,我们验证了现有的观察结果,例如对抗性增强模型的形状偏差,以及新的失败模式,例如零散 shot CLIP 分类器的系统错误,或识别出有害的伪特征。此外,我们的 VCE 表现出色,尽管更加灵活。
https://arxiv.org/abs/2311.17833
State-of-the-art models for pixel-wise prediction tasks such as image restoration, image segmentation, or disparity estimation, involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then sequentially increased to generate a high-resolution output. Several previous works have investigated the effect of artifacts that are invoked during downsampling and diverse cures have been proposed that facilitate to improve prediction stability and even robustness for image classification. However, equally relevant, artifacts that arise during upsampling have been less discussed. This is significantly relevant as upsampling and downsampling approaches face fundamentally different challenges. While during downsampling, aliases and artifacts can be reduced by blurring feature maps, the emergence of fine details is crucial during upsampling. Blurring is therefore not an option and dedicated operations need to be considered. In this work, we are the first to explore the relevance of context during upsampling by employing convolutional upsampling operations with increasing kernel size while keeping the encoder unchanged. We find that increased kernel sizes can in general improve the prediction stability in tasks such as image restoration or image segmentation, while a block that allows for a combination of small-size kernels for fine details and large-size kernels for artifact removal and increased context yields the best results.
用于像素级预测任务,如图像修复、图像分割或差异估计的最新模型涉及多个数据重新采样阶段。在降低分辨率以聚合信息后,然后按顺序增加分辨率以生成高分辨率输出。几项研究调查了在降采样过程中出现的伪影,并提出了几种不同的修复方法以提高预测稳定性,甚至对于图像分类。然而,同样相关的是,在升采样过程中出现的伪影并没有受到更多的讨论。这对于升采样和降采样方法所面临的根本性挑战来说是非常相关的。尽管在降采样过程中,通过模糊特征图可以降低伪影和伪影,但升采样过程中细粒度的出现至关重要。因此,升采样不是一种选择,需要考虑专门的操作。在这项工作中,我们首先通过使用具有逐渐增大内核大小的卷积上采样操作来探讨上下文在升采样过程中的重要性。我们发现,在图像修复或分割等任务中,增加内核大小通常可以提高预测稳定性。然而,允许小尺寸内核和大数据内核的结合使得细节部分和伪影去除以及上下文生成的最佳结果。
https://arxiv.org/abs/2311.17524
The accelerated advancement of generative AI significantly enhance the viability and effectiveness of generative regional editing methods. This evolution render the image manipulation more accessible, thereby intensifying the risk of altering the conveyed information within original images and even propagating misinformation. Consequently, there exists a critical demand for robust capable of detecting the edited images. However, the lack of comprehensive dataset containing images edited with abundant and advanced generative regional editing methods poses a substantial obstacle to the advancement of corresponding detection methods. We endeavor to fill the vacancy by constructing the GRE dataset, a large-scale generative regional editing dataset with the following advantages: 1) Collection of real-world original images, focusing on two frequently edited scenarios. 2) Integration of a logical and simulated editing pipeline, leveraging multiple large models in various modalities. 3) Inclusion of various editing approaches with distinct architectures. 4) Provision of comprehensive analysis tasks. We perform comprehensive experiments with proposed three tasks: edited image classification, edited method attribution and edited region localization, providing analysis of distinct editing methods and evaluation of detection methods in related fields. We expect that the GRE dataset can promote further research and exploration in the field of generative region editing detection.
生成式 AI 的快速发展显著增强了生成式区域编辑方法的可行性和有效性。这一演变使得图像编辑变得更易于操作,从而加剧了在原始图像中传达的信息的改变和甚至传播错误信息的风险。因此,对于检测编辑后的图像具有稳健能力的需求至关重要。然而,缺乏包含丰富且高级生成式区域编辑方法的全面数据集对相应检测方法的进步构成了巨大的障碍。我们努力通过构建GRE数据集来填补这一空缺:一个大规模生成式区域编辑数据集,具有以下优点:1)收集了真实世界原始图像,重点关注两个经常编辑的场景。2)集成了逻辑和模拟编辑管道,利用各种大型模型在各种模态中。3)纳入了各种编辑方法,具有不同的架构。4)提供了全面分析任务。我们用三个任务对提出的数据集进行了全面实验:编辑图像分类、编辑方法归因和编辑区域定位。这为我们对相关领域中检测生成式区域编辑方法的研究和探索提供了分析和对检测方法的评估。我们希望GRE数据集能够促进该领域生成式区域编辑检测的进一步研究和探索。
https://arxiv.org/abs/2311.17953
Modern deep neural networks have achieved great successes in medical image analysis. However, the features captured by convolutional neural networks (CNNs) or Transformers tend to be optimized for pixel intensities and neglect key anatomical structures such as connected components and loops. In this paper, we propose a persistent homology guided approach (PHG-Net) that explores topological features of objects for medical image classification. For an input image, we first compute its cubical persistence diagram and extract topological features into a vector representation using a small neural network (called the PH module). The extracted topological features are then incorporated into the feature map generated by CNN or Transformer for feature fusion. The PH module is lightweight and capable of integrating topological features into any CNN or Transformer architectures in an end-to-end fashion. We evaluate our PHG-Net on three public datasets and demonstrate its considerable improvements on the target classification tasks over state-of-the-art methods.
现代的深度神经网络已经在医学图像分析方面取得了巨大的成功。然而,卷积神经网络(CNNs)或Transformer捕获的特征通常是为了优化像素强度,而忽略了关键解剖结构,如连接元件和环。在本文中,我们提出了一种基于持久性同构的引导方法(PHG-Net),用于探索对象的拓扑结构特征,进行医学图像分类。对于输入图像,我们首先计算其立方持久性图,并使用一个小型神经网络(被称为PH模块)将其提取为拓扑特征向量表示。提取的拓扑特征随后被纳入由CNN或Transformer生成的特征图进行特征融合。PH模块轻便且能够以端到端的方式将拓扑特征集成到任何CNN或Transformer架构中。我们在三个公开数据集上评估我们的PHG-Net,并证明了其在目标分类任务上显著优于现有方法。
https://arxiv.org/abs/2311.17243