In this paper, we present a simple yet effective contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Unlike traditional knowledge distillation methods that concentrate on maximizing feature similarities or preserving class-wise semantic correlations between teacher and student features, our method attempts to recover the "dark knowledge" by aligning sample-wise teacher and student logits. Specifically, our method first minimizes logit differences within the same sample by considering their numerical values, thus preserving intra-sample similarities. Next, we bridge semantic disparities by leveraging dissimilarities across different samples. Note that constraints on intra-sample similarities and inter-sample dissimilarities can be efficiently and effectively reformulated into a contrastive learning framework with newly designed positive and negative pairs. The positive pair consists of the teacher's and student's logits derived from an identical sample, while the negative pairs are formed by using logits from different samples. With this formulation, our method benefits from the simplicity and efficiency of contrastive learning through the optimization of InfoNCE, yielding a run-time complexity that is far less than $O(n^2)$, where $n$ represents the total number of training samples. Furthermore, our method can eliminate the need for hyperparameter tuning, particularly related to temperature parameters and large batch sizes. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO. Experimental results clearly confirm the effectiveness of the proposed method on both image classification and object detection tasks. Our source codes will be publicly available at this https URL.
在本文中,我们提出了一种简单而有效的对比性知识蒸馏方法,可以将其表述为样本层面的对齐问题,具有内部样本和跨样本约束。与传统的知识蒸馏方法不同,该方法试图通过将样本层面的教师和学生的对数值对齐来恢复“暗知识”,具体来说,我们的方法首先通过考虑它们的数值值来最小化同一样本内的对数值差异,从而保留内部样本相似性。接下来,我们通过利用不同样本之间的差异来桥通常的语义差异。需要注意的是,对内部样本相似性和跨样本差异的限制可以有效地转化为一个新的设计的有向二进制对齐学习框架。其中一对正对是由相同的样本生成的教师和学生的对数值,而负对则是由不同样本的推理得出的。通过这种表示方法,我们的方法通过优化InfoNCE实现了对比学习的高效性和效率,其运行时间复杂度远低于$O(n^2)$,其中$n$表示训练样本的总数。此外,我们的方法可以消除关于温度参数和大批量的超参数 tuning需求,特别与温度参数和大的批量大小的相关。我们对包括CIFAR-100、ImageNet-1K和MS COCO在内的三个数据集进行了全面的实验。实验结果明确证实了所提出方法在图像分类和目标检测任务上的有效性。我们的源代码将公开发布在这个https URL上。
https://arxiv.org/abs/2404.14109
Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image of its class. Our prior work that applied the Converting Autoencoder and a simple classifier in tandem achieved moderate accuracy over simple datasets, such as MNIST and FMNIST. However, on more complex datasets like CIFAR-10, the Converting Autoencoder has a large reconstruction loss, making it unsuitable for enhancing DNN accuracy. To address these limitations, we generalize the design of Converting Autoencoders by leveraging a larger class of DNNs, those with architectures comprising feature extraction layers followed by classification layers. We incorporate a generalized algorithmic design of the Converting Autoencoder and intraclass clustering to identify representative images, leading to optimized image feature learning. Next, we demonstrate the effectiveness of our EncodeNet design and training framework, improving the accuracy of well-trained baseline DNNs while maintaining the overall model size. EncodeNet's building blocks comprise the trained encoder from our generalized Converting Autoencoders transferring knowledge to a lightweight classifier network - also extracted from the baseline DNN. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100. It outperforms state-of-the-art techniques that rely on knowledge distillation and attention mechanisms, delivering higher accuracy for models of comparable size.
图像分类是计算机视觉中的一个基本任务,而通过不增加模型大小或延迟来提高DNN准确性仍然是一个迫切需要解决的问题。在这方面,我们做出了一些进展,导致了一种新颖的EncodeNet设计和训练框架。这一进步包括将变压器转换为将图像转换为其类别的容易分类图像的新方法。我们在之前的工作中,将变压器与简单的分类器一起应用,在简单的数据集(如MNIST和FMNIST)上取得了中等准确度。然而,在更复杂的数据集如CIFAR-10上,变压器的重建损失很大,使得它不适合提高DNN准确性。为了克服这些限制,我们通过利用更大类别的DNN,那些由特征提取层 followed by 分类层组成的方法,对变压器的设计进行了扩展。我们将变压器的通用算法设计以及内类聚类引入EncodeNet,以识别具有代表性的图像,从而实现优化图像特征学习。接下来,我们证明了EncodeNet的设计和训练框架的有效性,同时优化了预训练基线DNN的准确性,而模型的整体大小保持不变。EncodeNet的构建模块包括从扩展的变压器中获得的训练好的编码器,以及提取自基线DNN的轻量级分类器网络。我们的实验结果表明,EncodeNet在CIFAR-10和RestNet20上的准确性从92.64%提高到了94.05%,而在CIFAR-100上从74.56%提高到了76.04%。它优于依赖于知识蒸馏和注意机制的先进技术,为具有类似大小的模型提供了更高的准确性。
https://arxiv.org/abs/2404.13770
Online task-free continual learning (OTFCL) is a more challenging variant of continual learning which emphasizes the gradual shift of task boundaries and learns in an online mode. Existing methods rely on a memory buffer composed of old samples to prevent forgetting. However,the use of memory buffers not only raises privacy concerns but also hinders the efficient learning of new samples. To address this problem, we propose a novel framework called I2CANSAY that gets rid of the dependence on memory buffers and efficiently learns the knowledge of new data from one-shot samples. Concretely, our framework comprises two main modules. Firstly, the Inter-Class Analogical Augmentation (ICAN) module generates diverse pseudo-features for old classes based on the inter-class analogy of feature distributions for different new classes, serving as a substitute for the memory buffer. Secondly, the Intra-Class Significance Analysis (ISAY) module analyzes the significance of attributes for each class via its distribution standard deviation, and generates the importance vector as a correction bias for the linear classifier, thereby enhancing the capability of learning from new samples. We run our experiments on four popular image classification datasets: CoRe50, CIFAR-10, CIFAR-100, and CUB-200, our approach outperforms the prior state-of-the-art by a large margin.
在线无任务持续学习(OTFCL)是一种更具挑战性的连续学习变体,它强调了任务边界的逐步转变和以在线方式学习。现有的方法依赖于由旧样本组成的记忆缓冲区来防止遗忘。然而,使用记忆缓冲区不仅引发了隐私问题,而且还会阻碍对新样本的有效学习。为解决这个问题,我们提出了一个名为I2CANSAY的新框架,它消除了对记忆缓冲区的依赖,并有效地从一挥发性样本中学习新数据的知識。具体来说,我们的框架由两个主要模块组成。首先,跨类别相似性增强(ICAN)模块根据不同新类之间的类內聚类相似性生成多样伪特征,作为记忆缓冲区的补充。其次,内类重要性分析(ISAY)模块通过分布标准差分析每个类的属性,并生成修正偏差向量作为线性分类器的增强剂,从而提高从新样本中学习的可能性。我们对四个流行的图像分类数据集:CoRe50,CIFAR-10,CIFAR-100和CUB-200)进行实验,我们的方法在很大程度上超过了先前的 state-of-the-art 水平。
https://arxiv.org/abs/2404.13576
Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and improve utilization, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.
由于其在自然语言处理方面的卓越表现,Transformer 在计算机视觉领域得到了广泛应用,超越了传统的卷积神经网络,并达到了最先进的状态。ViT 将图像分割成几个局部区域,称为“视觉句子”。然而,图像中的信息非常丰富和复杂,仅关注“视觉句子”层面的特征是不够的。在局部区域之间也应该考虑特征。为了实现进一步的改进,我们提出了 TNT 模型,其算法将图像进一步分割成更小的区域,即“视觉词”,从而实现更准确的结果。Transformer 的核心是多头注意力机制,而传统的注意力机制忽略了不同注意头之间的交互。为了减少冗余并提高利用率,我们引入了嵌套算法,并将 Nested-TNT 应用于图像分类任务。实验证实,与 ViT 和 TNT 相比,所提出的模型在 CIFAR10 数据集和 FLOWERS102 数据集上获得了更好的分类性能,分别超过 2.25% 和 2.78%。
https://arxiv.org/abs/2404.13434
In recent years, Vision Transformers (ViTs) have shown promising classification performance over Convolutional Neural Networks (CNNs) due to their self-attention mechanism. Many researchers have incorporated ViTs for Hyperspectral Image (HSI) classification. HSIs are characterised by narrow contiguous spectral bands, providing rich spectral data. Although ViTs excel with sequential data, they cannot extract spectral-spatial information like CNNs. Furthermore, to have high classification performance, there should be a strong interaction between the HSI token and the class (CLS) token. To solve these issues, we propose a 3D-Convolution guided Spectral-Spatial Transformer (3D-ConvSST) for HSI classification that utilizes a 3D-Convolution Guided Residual Module (CGRM) in-between encoders to "fuse" the local spatial and spectral information and to enhance the feature propagation. Furthermore, we forego the class token and instead apply Global Average Pooling, which effectively encodes more discriminative and pertinent high-level features for classification. Extensive experiments have been conducted on three public HSI datasets to show the superiority of the proposed model over state-of-the-art traditional, convolutional, and Transformer models. The code is available at this https URL.
近年来,由于自注意力机制,Vision Transformers (ViTs) 在卷积神经网络(CNNs)上表现出了良好的分类性能。许多研究者将ViTs应用于高光谱图像(HSI)分类。HSIs的特点是狭窄的连续频带,提供丰富的光谱数据。尽管ViTs在序列数据上表现出色,但它们无法像CNNs一样提取光谱-空间信息。此外,为了获得高分类性能,HSI令牌与类(CLS)令牌之间应该存在强烈的相互作用。为解决这些问题,我们提出了一个3D卷积引导的高光谱空间Transformer(3D-ConvSST)用于HSI分类,该模型在编码器之间利用3D卷积引导残差模块(CGRM)来“融合”局部空间和光谱信息,并增强特征传播。此外,我们摒弃了类标签,而是应用全局平均池化,这有效地为分类编码更具有区分性和相关性的高级特征。在三个公开的HSI数据集上进行了广泛的实验,以证明与最先进的传统卷积、转换器模型相比,所提出的模型具有优越性。代码可在此https URL上获取。
https://arxiv.org/abs/2404.13252
Underwater images taken from autonomous underwater vehicles (AUV's) often suffer from low light, high turbidity, poor contrast, motion-blur and excessive light scattering and hence require image enhancement techniques for object recognition. Machine learning methods are being increasingly used for object recognition under such adverse conditions. These enhanced object recognition methods of images taken from AUV's has potential applications in underwater pipeline and optical fibre surveillance, ocean bed resource extraction, ocean floor mapping, underwater species exploration, etc. While the classical machine learning methods are very efficient in terms of accuracy, they require large datasets and high computational time for image classification. In the current work, we use quantum-classical hybrid machine learning methods for real-time under-water object recognition on-board an AUV for the first time. We use real-time motion-blurred and low-light images taken from an on-board camera of AUV built in-house and apply existing hybrid machine learning methods for object recognition. Our hybrid methods consist of quantum encoding and flattening of classical images using quantum circuits and sending them to classical neural networks for image classification. The results of hybrid methods carried out using Pennylane based quantum simulators both on GPU and using pre-trained models on an on-board NVIDIA GPU chipset are compared with results from corresponding classical machine learning methods. We observe that the hybrid quantum machine learning methods show an efficiency greater than 65\% and reduction in run-time by one-thirds and require 50\% smaller dataset sizes for training the models compared to classical machine learning methods. We hope that our work opens up further possibilities in quantum enhanced real-time computer vision in autonomous vehicles.
自主水下车辆(AUV)拍摄的水下图像通常存在低光、高浊度、对比度差、运动模糊和过度光线散射等问题,因此需要图像增强技术来进行目标识别。机器学习方法在AUV拍摄的水下图像目标识别方面得到了越来越多的应用。利用AUV拍摄的水下图像的增强目标识别方法具有潜在的应用,如水下管道和光纤监测、海底资源开采、海底地形图、水下物种探索等。尽管经典的机器学习方法在准确性方面非常有效,但它们需要大量数据和高的计算时间进行图像分类。在当前工作中,我们使用量子经典混合机器学习方法进行AUV上实时水下物体识别,这是第一次在AUV上实现。我们使用AUV自带相机上的实时运动模糊和低光图像,并应用现有的混合机器学习方法进行目标识别。我们的混合方法包括量子编码和经典图像平铺,利用量子电路对经典图像进行量子编码,并将其发送到经典神经网络进行图像分类。使用Pennylane基于量子模拟器的混合方法在GPU和预训练的模型上进行的结果与相应的经典机器学习方法的结果进行了比较。我们观察到,混合量子机器学习方法显示出比经典机器学习方法超过65%的效率,并且在运行时间上减少了三分之一,同时训练模型的数据集需要量比经典方法小50%。我们希望我们的工作为自主车辆的量子增强实时计算机视觉开辟更广阔的可能性。
https://arxiv.org/abs/2404.13130
Neural networks are trained by minimizing a loss function that defines the discrepancy between the predicted model output and the target value. The selection of the loss function is crucial to achieve task-specific behaviour and highly influences the capability of the model. A variety of loss functions have been proposed for a wide range of tasks affecting training and model performance. For classification tasks, the cross entropy is the de-facto standard and usually the first choice. Here, we try to experimentally challenge the well-known loss functions, including cross entropy (CE) loss, by utilizing the genetic programming (GP) approach, a population-based evolutionary algorithm. GP constructs loss functions from a set of operators and leaf nodes and these functions are repeatedly recombined and mutated to find an optimal structure. Experiments were carried out on different small-sized datasets CIFAR-10, CIFAR-100 and Fashion-MNIST using an Inception model. The 5 best functions found were evaluated for different model architectures on a set of standard datasets ranging from 2 to 102 classes and very different sizes. One function, denoted as Next Generation Loss (NGL), clearly stood out showing same or better performance for all tested datasets compared to CE. To evaluate the NGL function on a large-scale dataset, we tested its performance on the Imagenet-1k dataset where it showed improved top-1 accuracy compared to models trained with identical settings and other losses. Finally, the NGL was trained on a segmentation downstream task for Pascal VOC 2012 and COCO-Stuff164k datasets improving the underlying model performance.
神经网络通过最小化定义了预测模型输出和目标值之间的差异的损失函数来进行训练。损失函数的选择对实现特定任务的行为和模型的能力具有关键影响。为广泛的任务,包括分类任务,已经提出了许多不同的损失函数。对于分类任务,交叉熵(CE)是最常见的,通常也是首选。在这里,我们试图通过利用基于遗传编程(GP)的方法,一种基于种群进化的算法,对著名的损失函数(包括CE损失)进行实验性的挑战。GP通过构建一组操作符和叶子节点来构建损失函数,并反复组合和突变以找到最优结构。在CIFAR-10、CIFAR-100和Fashion-MNIST等小型数据集上进行了实验。我们使用Inception模型对不同规模和类别的标准数据集进行了评估。我们发现,有5个最佳函数在不同的模型架构上进行了评估,这些数据集的类别从2到102不等,且大小各不相同。其中一个函数,被称为Next Generation Loss(NGL),在所有测试数据集上都表现出与CE相同或更好的性能。为了在大型数据集上评估NGL函数,我们在ImageNet-1k数据集上进行了测试,该数据集与具有相同设置的模型训练的模型相比,其顶1准确率有所提高。最后,在Pascal VOC 2012和COCO-Stuff164k数据集上,我们对NGL进行了一次分割下游任务的训练,提高了底层模型的性能。
https://arxiv.org/abs/2404.12948
As point cloud provides a natural and flexible representation usable in myriad applications (e.g., robotics and self-driving cars), the ability to synthesize point clouds for analysis becomes crucial. Recently, Xie et al. propose a generative model for unordered point sets in the form of an energy-based model (EBM). Despite the model achieving an impressive performance for point cloud generation, one separate model needs to be trained for each category to capture the complex point set distributions. Besides, their method is unable to classify point clouds directly and requires additional fine-tuning for classification. One interesting question is: Can we train a single network for a hybrid generative and discriminative model of point clouds? A similar question has recently been answered in the affirmative for images, introducing the framework of Joint Energy-based Model (JEM), which achieves high performance in image classification and generation simultaneously. This paper proposes GDPNet, the first hybrid Generative and Discriminative PointNet that extends JEM for point cloud classification and generation. Our GDPNet retains strong discriminative power of modern PointNet classifiers, while generating point cloud samples rivaling state-of-the-art generative approaches.
点云作为一种自然且灵活的表示形式,可用于各种应用场景(例如机器人学和自动驾驶汽车),因此生成点云用于分析的能力变得至关重要。最近,Xie等人提出了一种基于能量的点集生成模型(EBM)来解决无序点集的生成问题。尽管该模型在点云生成方面取得了令人印象深刻的性能,但每个类别都需要单独训练一个模型来捕获复杂的点集分布。此外,他们的方法无法直接对点云进行分类,需要进行额外的微调来进行分类。一个有趣的问题就是:我们能否为点云的混合生成和判别模型训练一个单一的神经网络?与图像类似,最近已经有人回答了这个问题,引入了基于联合能量的模型(JEM)框架,该框架在同时实现图像分类和生成方面取得了高绩效。本文提出了GDPNet,这是第一个将JEM扩展到点云分类和生成的混合生成和判别点网络。我们的GDPNet保留了现代点网分类器的强大判别能力,同时生成点云样本,与最先进的生成方法媲美。
https://arxiv.org/abs/2404.12925
This study proposes a multi-modal fusion framework Multitrans based on the Transformer architecture and self-attention mechanism. This architecture combines the study of non-contrast computed tomography (NCCT) images and discharge diagnosis reports of patients undergoing stroke treatment, using a variety of methods based on Transformer architecture approach to predicting functional outcomes of stroke treatment. The results show that the performance of single-modal text classification is significantly better than single-modal image classification, but the effect of multi-modal combination is better than any single modality. Although the Transformer model only performs worse on imaging data, when combined with clinical meta-diagnostic information, both can learn better complementary information and make good contributions to accurately predicting stroke treatment effects..
本研究提出了一种基于Transformer架构的多模态融合框架Multitrans,该框架基于自注意力机制。该架构将非对比计算断层扫描(NCCT)图像和接受中风治疗的患者出院诊断报告相结合,利用Transformer架构方法预测中风治疗的功能性结果。结果显示,单模态文本分类的性能明显优于单模态图像分类,但多模态组合的效果要优于任何单一模态。尽管Transformer模型在图像数据上的表现仅略逊于其他模型,但与临床元诊断信息相结合时,两者可以获得更好的互补信息,从而准确预测中风治疗效果。
https://arxiv.org/abs/2404.12634
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。
https://arxiv.org/abs/2404.12210
Explainable Artificial Intelligence (XAI) poses a significant challenge in providing transparent and understandable insights into complex AI models. Traditional post-hoc algorithms, while useful, often struggle to deliver interpretable explanations. Concept-based models offer a promising avenue by incorporating explicit representations of concepts to enhance interpretability. However, existing research on automatic concept discovery methods is often limited by lower-level concepts, costly human annotation requirements, and a restricted domain of background knowledge. In this study, we explore the potential of a Large Language Model (LLM), specifically GPT-4, by leveraging its domain knowledge and common-sense capability to generate high-level concepts that are meaningful as explanations for humans, for a specific setting of image classification. We use minimal textual object information available in the data via prompting to facilitate this process. To evaluate the output, we compare the concepts generated by the LLM with two other methods: concepts generated by humans and the ECII heuristic concept induction system. Since there is no established metric to determine the human understandability of concepts, we conducted a human study to assess the effectiveness of the LLM-generated concepts. Our findings indicate that while human-generated explanations remain superior, concepts derived from GPT-4 are more comprehensible to humans compared to those generated by ECII.
可解释人工智能(XAI)在为复杂AI模型提供透明和可理解的洞察方面提出了重大挑战。虽然传统的后验算法在某种程度上很有用,但往往很难提供可解释的解释。基于概念的模型通过将概念的显式表示来提高可解释性,为这一途径提供了有前景的方法。然而,现有的自动概念发现方法的 research 通常受到较低级别的概念、昂贵的人类标注要求和受限的知识域的限制。在这项研究中,我们探讨了大型语言模型(LLM)特别是 GPT-4 的潜力,通过利用其领域知识和常识能力生成高质量的概念,作为解释为人类对图像分类的特定场景的高层次概念。我们通过提示来最小化数据中可用到的文本对象信息,促进这一过程。为了评估输出,我们比较了 LLM 生成的概念与其他两种方法:由人类生成的概念和 ECII 隐式概念诱导系统生成的概念。由于没有明确的方法来确定人类对概念的理解,我们进行了一项人类研究来评估 LLM 生成的概念的有效性。我们的研究结果表明,尽管人类生成的解释仍然具有优势,但 GPT-4 生成的概念对人类来说更有可理解性,与 ECII 生成的概念相比更具可理解性。
https://arxiv.org/abs/2404.11875
Achieving rotation invariance in deep neural networks without relying on data has always been a hot research topic. Intrinsic rotation invariance can enhance the model's feature representation capability, enabling better performance in tasks such as multi-orientation object recognition and detection. Based on various types of non-learnable operators, including gradient, sort, local binary pattern, maximum, etc., this paper designs a set of new convolution operations that are natually invariant to arbitrary rotations. Unlike most previous studies, these rotation-invariant convolutions (RIConvs) have the same number of learnable parameters and a similar computational process as conventional convolution operations, allowing them to be interchangeable. Using the MNIST-Rot dataset, we first verify the invariance of these RIConvs under various rotation angles and compare their performance with previous rotation-invariant convolutional neural networks (RI-CNNs). Two types of RIConvs based on gradient operators achieve state-of-the-art results. Subsequently, we combine RIConvs with different types and depths of classic CNN backbones. Using the OuTex_00012, MTARSI, and NWPU-RESISC-45 datasets, we test their performance on texture recognition, aircraft type recognition, and remote sensing image classification tasks. The results show that RIConvs significantly improve the accuracy of these CNN backbones, especially when the training data is limited. Furthermore, we find that even with data augmentation, RIConvs can further enhance model performance.
在不依赖数据的情况下实现深度神经网络的旋转不变性一直是一个热门的研究课题。固有旋转不变性可以增强模型的特征表示能力,从而在诸如多方向物体识别和检测等任务中取得更好的性能。根据各种非学习操作类型,包括梯度、排序、局部二值模式、最大等,本文设计了一组新的卷积操作,它们自然对任意旋转对称。与大多数之前的研究不同,这些旋转不变的卷积操作(RIConvs)具有与传统卷积操作相同的学习参数和类似的计算过程,因此可以互换。使用MNIST-Rot数据集,我们首先验证这些RIConvs在各种旋转角度下的不变性,并将其性能与之前的目标不变卷积神经网络(RI-CNNs)进行比较。基于梯度操作的两种RIConvs达到了最先进的结果。接着,我们将RIConvs与不同类型和深度的经典卷积网络骨干相结合。使用OuTex_00012、MTARSI和NWPU-RESISC-45数据集,我们测试了它们在纹理识别、飞机类型识别和遥感图像分类任务上的性能。结果表明,RIConvs显著提高了这些卷积网络骨干的准确性,特别是在训练数据有限的情况下。此外,我们发现,即使进行数据增强,RIConvs也可以进一步提高模型性能。
https://arxiv.org/abs/2404.11309
Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suffer from high latency and a large memory footprint in inference, which limits their further deployment on resource-constrained edge devices. In this work, we propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment. During the first stage, lightweight image/text student models are designed to learn robust visual/multilingual textual feature representation ability from corresponding teacher models, respectively. Subsequently, the multilingual vision-language alignment stage enables effective alignment of visual and multilingual textual features to further improve the model's multilingual performance. Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context and competitive performance in the Chinese context, even with less training data, when compared to existing models of similar parameter magnitude. The evaluation demonstrates the effectiveness of our designed training mechanism.
预训练的视觉语言(V-L)模型,如CLIP,已经在许多下游跨模态任务中表现出优异性能。然而,大多数模型仅适用于英语环境。后续研究集中于解决这个问题,并提出了一些改进的模型,如CN-CLIP和AltCLIP,以促进其适用于中文和其他语言环境。然而,这些模型在推理时存在较高的延迟和高内存开销,这限制了它们在资源受限的边缘设备上的进一步部署。在这项工作中,我们提出了一个概念简单但有效的多语言CLIP压缩框架,并训练了一个轻量级的多语言视觉语言模型,称为DC-CLIP,用于中文和英语环境。在这个框架中,我们收集了高质量的中国和英语文本图像对,并设计了两阶段训练,包括多语言视觉语言特征提取和对齐。在第一阶段,我们设计了一系列轻量级的图像/文本学生模型,以从相应的主模型中学到稳健的视觉/多语言文本特征表示能力。随后,多语言视觉语言对齐阶段使视觉和多语言文本特征之间进行有效对齐,从而进一步提高模型的多语言性能。基于ELEVATER基准的零散图像分类实验展示了DC-CLIP在英语环境和类似参数规模的现有模型中具有卓越的性能,即使训练数据较少,也能在中文环境中实现竞争力的性能。评估证明了我们设计的训练机制的有效性。
https://arxiv.org/abs/2404.11249
Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer. Breast tissue histopathological examination is critical in diagnosing and classifying breast cancer. Although existing methods have shown promising results, there is still room for improvement in the classification accuracy and generalization of IDC using histopathology images. We present a novel approach, Supervised Contrastive Vision Transformer (SupCon-ViT), for improving the classification of invasive ductal carcinoma in terms of accuracy and generalization by leveraging the inherent strengths and advantages of both transfer learning, i.e., pre-trained vision transformer, and supervised contrastive learning. Our results on a benchmark breast cancer dataset demonstrate that SupCon-Vit achieves state-of-the-art performance in IDC classification, with an F1-score of 0.8188, precision of 0.7692, and specificity of 0.8971, outperforming existing methods. In addition, the proposed model demonstrates resilience in scenarios with minimal labeled data, making it highly efficient in real-world clinical settings where labelled data is limited. Our findings suggest that supervised contrastive learning in conjunction with pre-trained vision transformers appears to be a viable strategy for an accurate classification of IDC, thus paving the way for a more efficient and reliable diagnosis of breast cancer through histopathological image analysis.
侵袭性导管癌(IDC)是乳腺癌中最常见的类型。对乳房组织进行病理学检查对于诊断和分类乳腺癌至关重要。尽管现有的方法已经显示出良好的效果,但在使用组织学图像进行IDC分类的准确性和泛化方面仍有改进的空间。我们提出了一个新方法,监督对比视觉Transformer(SupCon-ViT),通过利用迁移学习(即预训练视觉Transformer)的固有优势和监督对比学习(Supervised contrastive learning)的优势,提高IDC分类的准确性和泛化。我们在基准乳腺癌数据集上的结果表明,SupCon-ViT在IDC分类上实现了最先进的性能,其F1分数为0.8188,精确度为0.7692,特异性为0.8971,优于现有方法。此外,与标记数据较少的情景相对,所提出的模型表现出强大的鲁棒性,因此在临床实践中,具有有限标记数据的情况下,该模型具有很高的效率。我们的研究结果表明,在监督对比学习与预训练视觉Transformer相结合的情况下,准确地分类IDC可能是可行的策略,为通过组织学图像分析更准确和可靠的乳腺癌诊断铺平道路。
https://arxiv.org/abs/2404.11052
Semi-supervised image classification, leveraging pseudo supervision and consistency regularization, has demonstrated remarkable success. However, the ongoing challenge lies in fully exploiting the potential of unlabeled data. To address this, we employ information entropy neural estimation to harness the potential of unlabeled samples. Inspired by contrastive learning, the entropy is estimated by maximizing a lower bound on mutual information across different augmented views. Moreover, we theoretically analyze that the information entropy of the posterior of an image classifier is approximated by maximizing the likelihood function of the softmax predictions. Guided by these insights, we optimize our model from both perspectives to ensure that the predicted probability distribution closely aligns with the ground-truth distribution. Given the theoretical connection to information entropy, we name our method \textit{InfoMatch}. Through extensive experiments, we show its superior performance.
半监督图像分类利用伪监督和一致性正则化取得了显著的成功。然而,当前的挑战在于充分利用未标记数据的潜力。为解决这个问题,我们采用信息熵神经估计来发掘未标记样本的潜力。受到对比学习启发,我们通过最大化不同增强视图间互信息下界来估计熵。此外,我们理论分析得出,分类器后验信息的熵近似于最大软max预测的概率函数。在这些启示下,我们从两个角度优化我们的模型,以确保预测概率分布与真实分布非常接近。由于与信息熵之间存在理论联系,我们将方法命名为 \textit{InfoMatch》。通过大量实验,我们证明了其卓越性能。
https://arxiv.org/abs/2404.11003
Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.
大视觉语言模型彻底颠覆了图像分类和语义分割范式。然而,它们通常在测试时假设一个预定义的词汇表,或词汇集,用于构建文本提示。在语义上下文未知或不断变化的情况下,这个假设是不实用的。在这里,我们解决了这个问题,并引入了无词汇图像分类(VIC)任务,该任务旨在将不受已知词汇表约束的语义空间中的类分配给输入图像。VIC 具有挑战性,因为语义空间非常广泛,包含数百万个概念,包括细粒度分类。为了应对 VIC,我们提出了从外部数据库中进行类别搜索(CaSED)的方法,这是一种训练免费的方法,它利用了一个预训练的视觉语言模型和外部数据库。 CaSED 首先从数据库中提取出最具语义相似性的捕捉到的候选类,然后根据相同的视觉语言模型将图像分配给最佳匹配的候选类。此外,我们还证明了 CaSED 可以局部应用于生成一个粗分割掩码,对图像区域进行分类,从而引入了词汇无语义分割任务。CaSED 和它的变体在分类和语义分割基准测试中优于其他更复杂的视觉语言模型,同时使用了更少的参数。
https://arxiv.org/abs/2404.10864
Images captured from the real world are often affected by different types of noise, which can significantly impact the performance of Computer Vision systems and the quality of visual data. This study presents a novel approach for defect detection in casting product noisy images, specifically focusing on submersible pump impellers. The methodology involves utilizing deep learning models such as VGG16, InceptionV3, and other models in both the spatial and frequency domains to identify noise types and defect status. The research process begins with preprocessing images, followed by applying denoising techniques tailored to specific noise categories. The goal is to enhance the accuracy and robustness of defect detection by integrating noise detection and denoising into the classification pipeline. The study achieved remarkable results using VGG16 for noise type classification in the frequency domain, achieving an accuracy of over 99%. Removal of salt and pepper noise resulted in an average SSIM of 87.9, while Gaussian noise removal had an average SSIM of 64.0, and periodic noise removal yielded an average SSIM of 81.6. This comprehensive approach showcases the effectiveness of the deep AutoEncoder model and median filter, for denoising strategies in real-world industrial applications. Finally, our study reports significant improvements in binary classification accuracy for defect detection compared to previous methods. For the VGG16 classifier, accuracy increased from 94.6% to 97.0%, demonstrating the effectiveness of the proposed noise detection and denoising approach. Similarly, for the InceptionV3 classifier, accuracy improved from 84.7% to 90.0%, further validating the benefits of integrating noise analysis into the classification pipeline.
现实世界中的图像通常受到各种类型的噪声的影响,这可能会显著影响计算机视觉系统和视觉数据的质量。这项研究提出了一种新的方法来检测铸件产品噪声图像中的缺陷,特别关注潜水泵叶轮。该方法涉及利用像VGG16、InceptionV3等这样的深度学习模型在空间和频域中识别噪声类型和缺陷状态。研究过程从预处理图像开始,然后应用特定噪声类别的去噪技术。通过将噪声检测和去噪整合到分类管道中,旨在提高缺陷检测的准确性和鲁棒性。使用VGG16在频域对噪声类型分类,获得了超过99%的准确率。消除盐和胡椒噪声平均SSIM为87.9,高斯噪声消除平均SSIM为64.0,周期性噪声消除平均SSIM为81.6。这种全面的方法突出了在现实工业应用中使用深度自编码器模型和均值滤波器进行去噪策略的有效性。最后,我们的研究报道了与以前方法相比,缺陷检测二分类准确率的重大改进。对于VGG16分类器,准确率从94.6%增加到97.0%,表明所提出的噪声检测和去噪方法的有效性。同样,对于InceptionV3分类器,准确率从84.7%增加到90.0%,进一步证实了将噪声分析整合到分类管道中的好处。
https://arxiv.org/abs/2404.10664
Hardware-aware Neural Architecture Search approaches (HW-NAS) automate the design of deep learning architectures, tailored specifically to a given target hardware platform. Yet, these techniques demand substantial computational resources, primarily due to the expensive process of assessing the performance of identified architectures. To alleviate this problem, a recent direction in the literature has employed representation similarity metric for efficiently evaluating architecture performance. Nonetheless, since it is inherently a single objective method, it requires multiple runs to identify the optimal architecture set satisfying the diverse hardware cost constraints, thereby increasing the search cost. Furthermore, simply converting the single objective into a multi-objective approach results in an under-explored architectural search space. In this study, we propose a Multi-Objective method to address the HW-NAS problem, called MO-HDNAS, to identify the trade-off set of architectures in a single run with low computational cost. This is achieved by optimizing three objectives: maximizing the representation similarity metric, minimizing hardware cost, and maximizing the hardware cost diversity. The third objective, i.e. hardware cost diversity, is used to facilitate a better exploration of the architecture search space. Experimental results demonstrate the effectiveness of our proposed method in efficiently addressing the HW-NAS problem across six edge devices for the image classification task.
硬件感知的神经架构搜索方法(HW-NAS)自动设计适用于特定目标硬件平台的深度学习架构。然而,这些技术需要大量的计算资源,主要原因是确定架构性能的过程代价昂贵。为解决这个问题,文献中最近的一个方向采用表示相似性度量来高效评估架构性能。然而,由于它本质上是一个单目标方法,因此需要多次运行来找到满足多样硬件成本约束的最优架构集合,从而增加搜索成本。此外,将单目标转换为多目标方法导致了一个未被探索的建筑搜索空间。在本研究中,我们提出了一种多目标方法来解决HW-NAS问题,称为MO-HDNAS,以在低计算成本的单次运行中识别架构的权衡集。这是通过优化三个目标来实现的:最大化表示相似性度量,最小化硬件成本,最大化硬件成本多样性。第三个目标,即硬件成本多样性,用于促进更好地探索架构搜索空间。实验结果表明,我们提出的方法在有效地解决图像分类任务的六个边缘设备上的HW-NAS问题方面非常有效。
https://arxiv.org/abs/2404.12403
In computer vision, explainable AI (xAI) methods seek to mitigate the 'black-box' problem by making the decision-making process of deep learning models more interpretable and transparent. Traditional xAI methods concentrate on visualizing input features that influence model predictions, providing insights primarily suited for experts. In this work, we present an interaction-based xAI method that enhances user comprehension of image classification models through their interaction. Thus, we developed a web-based prototype allowing users to modify images via painting and erasing, thereby observing changes in classification results. Our approach enables users to discern critical features influencing the model's decision-making process, aligning their mental models with the model's logic. Experiments conducted with five images demonstrate the potential of the method to reveal feature importance through user interaction. Our work contributes a novel perspective to xAI by centering on end-user engagement and understanding, paving the way for more intuitive and accessible explainability in AI systems.
在计算机视觉领域,可解释性AI(xAI)方法旨在通过使深度学习模型的决策过程更加可解释和透明来解决“黑盒子”问题。传统的xAI方法集中精力可视化影响模型预测的输入特征,为专家提供最适合的见解。在这项工作中,我们提出了一个基于交互的xAI方法,通过用户交互来增强用户对图像分类模型的理解。因此,我们开发了一个基于网页的原型,使用户可以通过涂画和擦除来修改图像,从而观察分类结果的变化。我们的方法使用户能够分辨影响模型决策过程的关键特征,将他们的思维模型与模型的逻辑对齐。用五张图像进行的实验证明了这种方法通过用户交互揭示特征的重要性。我们的工作为xAI领域提供了一个新的视角,将重点放在了最终用户的参与和理解上,为AI系统提供了更直观和易用的可解释性。
https://arxiv.org/abs/2404.09828
In pseudo-labeling (PL), which is a type of semi-supervised learning, pseudo-labels are assigned based on the confidence scores provided by the classifier; therefore, accurate confidence is important for successful PL. In this study, we propose a PL algorithm based on an energy-based model (EBM), which is referred to as the energy-based PL (EBPL). In EBPL, a neural network-based classifier and an EBM are jointly trained by sharing their feature extraction parts. This approach enables the model to learn both the class decision boundary and input data distribution, enhancing confidence calibration during network training. The experimental results demonstrate that EBPL outperforms the existing PL method in semi-supervised image classification tasks, with superior confidence calibration error and recognition accuracy.
在伪标签(PL)中,这是一种半监督学习方法,伪标签根据分类器的置信分数分配;因此,准确的置信度对于成功的PL至关重要。在这项研究中,我们提出了一个基于能量模型的PL算法,被称为能量为基础的PL(EBPL)。在EBPL中,一个基于神经网络的分类器和一个基于能量模型的分类器共享其特征提取部分进行共同训练。这种方法使得模型能够在网络训练过程中同时学习分类决策边界和输入数据分布,从而提高信心估计误差和识别准确性。实验结果表明,EBPL在半监督图像分类任务中优于现有PL方法,具有卓越的置信度估计误差和识别准确性。
https://arxiv.org/abs/2404.09585