Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at this https URL.
卷积神经网络(CNN)是计算机视觉任务中的关键工具,但它们缺乏一些传统上期望的特性,这些特性能够进一步提升模型性能,比如旋转等变性。这种性质在生物医学图像中很常见,而这类图像通常没有明确的方向信息。虽然当前的研究主要依赖于数据增强或显式模块来捕捉方向信息,但这会增加训练成本或导致对所需等变性的无效近似。为了克服这些挑战,我们提出了一种新颖且高效的对称旋转等变(SRE)卷积核(SRE-Conv)的实现方法,旨在学习旋转不变特征的同时压缩模型大小。SRE-Conv 核可以轻松集成到任何 CNN 主干网络中。我们使用公开的 MedMNISTv2 数据集(共16个任务)验证了深层 SRE-CNN 捕获旋转等变性的能力。在二维和三维图像的所有16个测试数据集中,SRE-Conv-CNN 在所有情况下都显示出了更高的旋转图像分类精度,并且通过减少参数数量和降低内存占用提高了效率。代码可在以下网址获得:[提供链接]。 请注意,在实际翻译过程中,请确保使用正确的URL以供访问相关资源。
https://arxiv.org/abs/2501.09753
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
训练深度神经网络需要大量带有标注的数据集。这些数据集的收集和标注不仅成本极高,还面临着法律和隐私问题。这些问题在许多实际应用中构成了重大限制。为了应对这一挑战,我们引入了一种名为HydraMix的新架构,该架构通过混合同一类中的多个不同图像来生成新的图像组合。HydraMix利用基于分割的混合掩码,在特征空间中学习多种图像内容的融合,并通过无监督和对抗性训练进行优化。我们的数据增强方案允许从非常小的数据集中从头开始创建模型。 我们在ciFAIR-10、STL-10和ciFAIR-100上进行了广泛的实验,并且还引入了一种新的文本-图像指标,用于评估扩充后的数据集的泛化能力。实验结果表明,HydraMix在小数据集上的图像分类任务中优于现有的最先进的方法。
https://arxiv.org/abs/2501.09504
Nowadays, more and more images are available. Annotation and retrieval of the images pose classification problems, where each class is defined as the group of database images labelled with a common semantic label. Various systems have been proposed for content-based retrieval, as well as for image classification and indexing. In this paper, a hierarchical classification framework has been proposed for bridging the semantic gap effectively and achieving multi-category image classification. A well known pre-processing and post-processing method was used and applied to three problems; image segmentation, object identification and image classification. The method was applied to classify single object images from Amazon and Google datasets. The classification was tested for four different classifiers; BayesNetwork (BN), Random Forest (RF), Bagging and Vote. The estimated classification accuracies ranged from 20% to 99% (using 10-fold cross validation). The Bagging classifier presents the best performance, followed by the Random Forest classifier.
如今,越来越多的图像可供使用。对这些图像进行标注和检索时会遇到分类问题,每个类别被定义为一组带有共同语义标签的数据集图片。已经提出了多种基于内容的检索系统以及用于图像分类和索引的方法。本文提出了一种层次化分类框架,旨在有效地弥合语义差距,并实现多类别的图像分类。文中使用并应用了一个著名的预处理和后处理方法来解决三个问题:图像分割、对象识别和图像分类。该方法被应用于亚马逊(Amazon)和谷歌(Google)数据集中的单个对象图片的分类任务上。 采用四种不同的分类器进行测试,包括贝叶斯网络(Bayes Network, BN)、随机森林(Random Forest, RF)、Bagging 和投票(Vote)。经过10折交叉验证后,估计的分类准确率范围从20%到99%。其中,Bagging 分类器表现最佳,其次是随机森林分类器。
https://arxiv.org/abs/2501.09311
Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
在医疗图像分类中,少样本学习(few-shot learning)面临着一个显著的挑战,即标注数据的有限可用性和医学影像的复杂性。本文提出了一种新的框架——自适应视觉-语言微调与层次对比对齐(HiCA),该框架利用大规模视觉-语言模型(LVLMs)的能力来进行医疗图像分析。HiCA引入了一个两阶段的微调策略,结合领域特定预训练和分层对比学习来在多个层级上对齐视觉和文本表示。我们在两个基准数据集——胸部X光片和乳腺超声检查数据集中评估了我们的方法,在少样本(few-shot)和零样本(zero-shot)设置下均取得了最先进的性能表现。进一步的分析表明,与现有的基线相比,该方法具有更强的鲁棒性、泛化能力和可解释性,并在性能上实现了显著提升。我们的工作强调了层次对比策略在将LVLMs适应医学成像任务独特挑战方面的潜力。
https://arxiv.org/abs/2501.09294
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at this https URL.
CLIP(对比语言图像预训练)在模式识别和计算机视觉领域取得了巨大成功。将CLIP转移到下游任务(如零样本或少样本分类)是多模态学习中的热门话题。然而,当前的研究主要集中在文本提示学习或视觉适配器微调上,未能充分挖掘图像-文本对之间的互补信息和关联性。在本文中,我们提出了一种图像描述增强的CLIP适配器(IDEA)方法,用于将CLIP适应于少样本图像分类任务。该方法通过利用图像的视觉特征和文本描述来捕捉细粒度特征。IDEA是一种针对CLIP的无需训练的方法,在多个任务上可以与最先进的模型媲美甚至超过它们。 此外,我们引入了Trainable-IDEA(T-IDEA),它在IDEA的基础上增加了两个轻量级可学习组件(即投影器和可学习潜在空间),进一步提升了模型性能,并在11个数据集上实现了最先进的结果。作为一项重要贡献,我们采用了Llama模型并设计了一个综合的管道来为11个数据集上的图像生成文本描述,总共产生了1,637,795对图像-文本配对,命名为"IMD-11"。 我们的代码和数据可在以下网址获取:[https://this-url.com](请将URL替换为您实际提供的地址)。
https://arxiv.org/abs/2501.08816
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at this https URL
特征提取技术在医学图像分类中至关重要;然而,传统的机器学习分类器与经典的特征提取方法经常表现出提供足够区分信息的能力不足,特别是在处理复杂的图像集合时。虽然卷积神经网络(CNN)和视觉变换器(ViT)在特征提取方面展现出巨大潜力,但由于医疗影像数据固有的特性,如样本量小或类内变异大,它们容易出现过拟合问题。 本文提出了一种新的方法——基于医学图像注意力的特征提取器(MIAFEx),该方法利用可学习的改进机制来增强变换器编码器架构中的分类标记。这种机制根据学到的权重调整标记,从而提高显著性特征的抽取,并增强了模型对医疗影像数据挑战的适应能力。本文将使用传统和混合分类器对比MIAFEx输出特征的质量与经典特征提取方法的结果;同时还将这些特征在分类任务上的性能与现代CNN和ViT模型进行比较,证明其在多个复杂医学图像分类数据集上具有更高的准确性和鲁棒性。这种优势尤其明显于样本量有限的场景中,在这种情况下,传统和现代模型往往难以有效地泛化。 该项目的源代码可以在以下链接找到:[此URL](在此处插入实际URL)
https://arxiv.org/abs/2501.08562
The ability to train ever-larger neural networks brings artificial intelligence to the forefront of scientific and technical discoveries. However, their exponentially increasing size creates a proportionally greater demand for energy and computational hardware. Incorporating complex physical events in networks as fixed, efficient computation modules can address this demand by decreasing the complexity of trainable layers. Here, we utilize ultrashort pulse propagation in multimode fibers, which perform large-scale nonlinear transformations, for this purpose. Training the hybrid architecture is achieved through a neural model that differentiably approximates the optical system. The training algorithm updates the neural simulator and backpropagates the error signal over this proxy to optimize layers preceding the optical one. Our experimental results achieve state-of-the-art image classification accuracies and simulation fidelity. Moreover, the framework demonstrates exceptional resilience to experimental drifts. By integrating low-energy physical systems into neural networks, this approach enables scalable, energy-efficient AI models with significantly reduced computational demands.
训练越来越大的神经网络将人工智能推向了科学和技术发现的前沿。然而,它们的大小呈指数级增长导致对能源和计算硬件的需求也相应地大幅增加。在神经网络中引入复杂的物理事件作为固定的、高效的计算模块可以通过减少可训练层的复杂性来应对这一需求。在这里,我们利用多模光纤中的超短脉冲传播来进行大规模非线性变换以实现此目标。通过一个能够微分逼近光学系统的神经模型,实现了混合架构的训练过程。该训练算法更新神经仿真器,并将错误信号反向传播到先前的光学层以进行优化。我们的实验结果达到了最先进的图像分类准确率和仿真的保真度。此外,该框架还表现出对实验漂移的强大适应性。通过将低能耗物理系统集成到神经网络中,这种方法能够构建可扩展、能源效率高且计算需求显著减少的人工智能模型。
https://arxiv.org/abs/2501.07991
deepTerra is a comprehensive platform designed to facilitate the classification of land surface features using machine learning and satellite imagery. The platform includes modules for data collection, image augmentation, training, testing, and prediction, streamlining the entire workflow for image classification tasks. This paper presents a detailed overview of the capabilities of deepTerra, shows how it has been applied to various research areas, and discusses the future directions it might take.
deepTerra 是一个全面的平台,旨在利用机器学习和卫星图像来促进地表特征分类。该平台包含数据收集、图像增强、训练、测试以及预测等模块,从而简化了整个图像分类任务的工作流程。本文详细介绍了 deepTerra 的各项功能,并展示了它在各种研究领域的应用情况,同时讨论了其未来可能的发展方向。
https://arxiv.org/abs/2501.07859
The deployment of neural networks in vehicle platforms and wearable Artificial Intelligence-of-Things (AIOT) scenarios has become a research area that has attracted much attention. With the continuous evolution of deep learning technology, many image classification models are committed to improving recognition accuracy, but this is often accompanied by problems such as large model resource usage, complex structure, and high power consumption, which makes it challenging to deploy on resource-constrained platforms. Herein, we propose an ultra-lightweight binary neural network (BNN) model designed for hardware deployment, and conduct image classification research based on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. In addition, we also verify it on the Chinese Traffic Sign (CTS) and Belgian Traffic Sign (BTS) datasets. The proposed model shows excellent recognition performance with an accuracy of up to 97.64%, making it one of the best performing BNN models in the GTSRB dataset. Compared with the full-precision model, the accuracy loss is controlled within 1%, and the parameter storage overhead of the model is only 10% of that of the full-precision model. More importantly, our network model only relies on logical operations and low-bit width fixed-point addition and subtraction operations during the inference phase, which greatly simplifies the design complexity of the processing element (PE). Our research shows the great potential of BNN in the hardware deployment of computer vision models, especially in the field of computer vision tasks related to autonomous driving.
在车辆平台和可穿戴人工智能物联网(AIOT)场景中部署神经网络已经成为一个备受关注的研究领域。随着深度学习技术的不断演进,许多图像分类模型致力于提高识别精度,但这通常伴随着大模型资源消耗、结构复杂以及高功耗等问题,这使得它们难以部署到资源受限的平台上。在此背景下,我们提出了一种专为硬件部署设计的超轻量级二值神经网络(BNN)模型,并基于德国交通标志识别基准数据集(GTSRB)进行了图像分类研究。此外,我们也在中国交通标志(CTS)和比利时交通标志(BTS)数据集上对其进行了验证。 所提出的模型表现出卓越的识别性能,在准确率方面达到97.64%,使其成为在GTSRB数据集中表现最好的二值神经网络模型之一。与全精度模型相比,其准确性损失控制在1%以内,并且该模型参数存储开销仅为全精度模型的10%。更重要的是,在推理阶段,我们的网络模型仅依赖逻辑运算和低位宽定点加减法操作,这极大地简化了处理单元(PE)的设计复杂度。 研究表明,BNN在计算机视觉模型硬件部署方面具有巨大潜力,特别是在与自动驾驶相关的计算机视觉任务领域。
https://arxiv.org/abs/2501.07808
Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.
知识蒸馏在计算机视觉任务处理中已被广泛采用,因为它可以有效利用从复杂的教师网络转移的知识来增强轻量级学生网络的性能。现有的大多数知识蒸馏方法都使用Kullback-Leibler散度(KL散度)来模仿教师网络和学生网络之间的logit输出概率。然而,这些方法可能会忽视教师“暗知识”的负面部分,因为其计算可能忽略了从教师的logit输出中产生的微小概率的影响。这种不足可能导致蒸馏过程中logit模仿次优,并导致学生网络获取的信息不平衡。在本文中,我们探讨了这一信息不平衡的影响,并提出了一种名为平衡散度蒸馏的新方法。通过引入反向Kullback-Leibler散度的补偿操作,我们的方法可以改善对教师极小值(负部分)的建模并保持正部分的学习能力。此外,我们还测试了不同温度系数调整的影响,这可能进一步平衡知识转移过程。我们在多个计算机视觉任务上评估了所提出的方法,包括图像分类和语义分割。实验结果表明,在CIFAR-100和ImageNet数据集上的轻量级学生网络准确率提高了1%~3%,在Cityscapes数据集中PSP-ResNet18的mIoU(平均交并比)提升了4.55%。实验证明,我们的方法是一种简单且高度有效的方法,可以平滑地应用于不同的知识蒸馏方法中。
https://arxiv.org/abs/2501.07804
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at this https URL.
图像金字塔广泛应用于高性能方法中,用于获取多尺度特征以实现精确的视觉感知和理解。然而,当前的图像金字塔使用相同的大型模型来处理不同分辨率的图像,导致计算成本显著增加。为了应对这一挑战,我们提出了一种新的网络架构,称为参数反转图像金字塔网络(PIIP)。具体来说,PIIP 使用预训练的模型(ViTs 或 CNNs)作为分支来处理多尺度图像,其中更高分辨率的图像由较小的网络分支进行处理,以平衡计算成本和性能。为了整合不同空间尺度的信息,我们还提出了一种新颖的跨分支特征交互机制。 为了验证 PIIP 的有效性,我们将它应用于各种感知模型以及一种代表性的多模态大型语言模型——LLaVA,并在对象检测、分割、图像分类和多模态理解等各项任务上进行了广泛的实验。PIIP 在计算成本更低的情况下,相较于单分支和现有多种分辨率方法实现了更优的性能表现。 当 PIIP 应用于大规模视觉基础模型 InternViT-6B 时,在检测和分割方面可以提升其1%-2% 的性能,并且仅需原始计算量的40%-60%,最终在 MS COCO 上实现 60.0 box AP,而在 ADE20K 上实现 59.7 mIoU。对于多模态理解,我们的 PIIP-LLaVA 使用仅有2.8M 的训练数据,在 TextVQA 上实现了 73.0% 准确率,并在 MMBench 上达到了 74.5%。 我们的代码已发布在这个网址上:[请参阅原文链接获取具体网址]。
https://arxiv.org/abs/2501.07783
Precision agriculture in general, and precision weeding in particular, have greatly benefited from the major advancements in deep learning and computer vision. A large variety of commercial robotic solutions are already available and deployed. However, the adoption by farmers of such solutions is still low for many reasons, an important one being the lack of trust in these systems. This is in great part due to the opaqueness and complexity of deep neural networks and the manufacturers' inability to provide valid guarantees on their performance. Conformal prediction, a well-established methodology in the machine learning community, is an efficient and reliable strategy for providing trustworthy guarantees on the predictions of any black-box model under very minimal constraints. Bridging the gap between the safe machine learning and precision agriculture communities, this article showcases conformal prediction in action on the task of precision weeding through deep learning-based image classification. After a detailed presentation of the conformal prediction methodology and the development of a precision spraying pipeline based on a ''conformalized'' neural network and well-defined spraying decision rules, the article evaluates this pipeline on two real-world scenarios: one under in-distribution conditions, the other reflecting a near out-of-distribution setting. The results show that we are able to provide formal, i.e. certifiable, guarantees on spraying at least 90% of the weeds.
总体而言,精准农业,特别是精准除草,在深度学习和计算机视觉领域的重大进展中受益匪浅。市场上已经出现了各种各样的商用机器人解决方案,并且这些方案已经被部署应用。然而,由于许多原因,包括对系统的信任不足,农民采用这类解决方案的比例仍然较低。这一问题在很大程度上源于深度神经网络的不透明性和复杂性以及制造商无法提供关于其性能的有效保证。 在机器学习社区中广受认可的方法——符合预测(Conformal Prediction)是一种有效的策略,可以在很少的约束条件下为任何黑盒模型的预测提供值得信赖的保证。本文旨在通过基于深度学习图像分类进行精准除草的任务展示符合预测的实际应用。文中详细介绍了符合预测方法,并开发了一个基于“经过验证”的神经网络和明确规定的喷洒决策规则构建的精准喷洒流程。 随后,文章评估了这一管道在两个真实场景中的表现:一个是在分布内条件下(即模型训练与测试数据来自同一分布),另一个则反映了一种接近分布外的情况。结果显示,我们可以提供正式的、可验证的保证,在至少90%的情况下准确识别并处理杂草。这意味着通过这种方法可以大大提高对精准农业系统性能的信任度,并促进其在实际应用中的广泛采纳。
https://arxiv.org/abs/2501.07185
Unlike image classification and annotation, for which deep network models have achieved dominating superior performances compared to traditional computer vision algorithms, deep learning for automatic image segmentation still faces critical challenges. One of such hurdles is to obtain ground-truth segmentations as the training labels for deep network training. Especially when we study biomedical images, such as histopathological images (histo-images), it is unrealistic to ask for manual segmentation labels as the ground truth for training due to the fine image resolution as well as the large image size and complexity. In this paper, instead of relying on clean segmentation labels, we study whether and how integrating imperfect or noisy segmentation results from off-the-shelf segmentation algorithms may help achieve better segmentation results through a new Adaptive Noise-Tolerant Network (ANTN) model. We extend the noisy label deep learning to image segmentation with two novel aspects: (1) multiple noisy labels can be integrated into one deep learning model; (2) noisy segmentation modeling, including probabilistic parameters, is adaptive, depending on the given testing image appearance. Implementation of the new ANTN model on both the synthetic data and real-world histo-images demonstrates its effectiveness and superiority over off-the-shelf and other existing deep-learning-based image segmentation algorithms.
与图像分类和标注不同,深度网络模型在这些任务上已经取得了相对于传统计算机视觉算法的主导性优势,在自动图像分割领域,深度学习仍然面临重大挑战。其中一个难题是获取用于训练深度网络的真实分割标签。特别是在研究生物医学图像(如组织病理学图像)时,由于图像分辨率高、尺寸大且复杂度高,要求手动提供真实的分割标签作为训练数据是不现实的。 在本文中,我们不再依赖于清洁的分割标签,而是探讨了整合现成分割算法提供的不完美或有噪声的结果是否以及如何通过一种新的自适应容错网络(ANTN)模型来实现更好的分割效果。我们将嘈杂标签深度学习应用于图像分割,并引入两个新颖方面:(1) 可以将多个带有噪音的标签集成到一个深度学习模型中;(2) 噪声分割建模,包括概率参数,是自适应的,根据给定测试图像的表现形式而变化。 在合成数据和真实世界的组织病理学图像上实现新的ANTN模型证明了其有效性和相对于现成及现有基于深度学习的图像分割算法的优势。
https://arxiv.org/abs/2501.07163
Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at this https URL.
扩展语义分割模型的词汇量是非常具有挑战性的,因为标注大规模掩码标签既费力又耗时。最近,有人提出了语言引导的分割模型来应对这一挑战,然而这些模型在应用于未见过类别(out-of-distribution categories)时性能显著下降。本文中,我们提出了一种新的大词汇语义分割框架——LarvSeg。与之前的工作不同,LarvSeg利用图像分类数据来扩展语义分割模型的词汇量,因为大型词汇分类数据集通常包含平衡的类别并且更容易获取。然而,在分类任务中,类别是图像级别的;而在分割任务中,则需要在像素级别预测标签。为解决这一问题,我们首先提出了一种通用基线框架,将图像级监督纳入像素级分割模型的训练过程中,从而使训练后的网络能够对新引入的分类数据中的类别进行语义分割。然后我们发现,在分割数据上训练的模型可以聚类超出训练词汇表类别的像素特征。受到这一发现的启发,我们设计了一种基于类别的注意力分类器,以监督对应类别的精确区域,从而提高模型性能。 大量的实验表明,LarvSeg显著提升了大词汇语义分割的表现,尤其是在没有掩码标签的类别上。借助ImageNet21K数据集的支持,首次提供了一个拥有21,000个类别的语义分割模型。代码可以在以下链接获取:[https://this-url-is-here-as-an-example.com](http://this-url-is-here-as-an-example.com)
https://arxiv.org/abs/2501.06862
In nations such as Bangladesh, agriculture plays a vital role in providing livelihoods for a significant portion of the population. Identifying and classifying plant diseases early is critical to prevent their spread and minimize their impact on crop yield and quality. Various computer vision techniques can be used for such detection and classification. While CNNs have been dominant on such image classification tasks, vision transformers has become equally good in recent time also. In this paper we study the various computer vision techniques for Bangladeshi rice leaf disease detection. We use the Dhan-Shomadhan -- a Bangladeshi rice leaf disease dataset, to experiment with various CNN and ViT models. We also compared the performance of such deep neural network architecture with traditional machine learning architecture like Support Vector Machine(SVM). We leveraged transfer learning for better generalization with lower amount of training data. Among the models tested, ResNet50 exhibited the best performance over other CNN and transformer-based models making it the optimal choice for this task.
在像孟加拉国这样的国家,农业对大量人口的生计起着至关重要的作用。早期识别和分类植物疾病对于防止疾病的传播以及减少其对农作物产量和质量的影响至关重要。可以使用各种计算机视觉技术来进行此类检测和分类。虽然卷积神经网络(CNNs)在这类图像分类任务中占据主导地位,但视觉变压器在近期也取得了同样出色的性能。本文研究了用于孟加拉国水稻叶片疾病检测的各种计算机视觉技术。我们使用 Dhan-Shomadhan——一个孟加拉国的水稻叶片病害数据集,来试验各种 CNN 和 ViT(Vision Transformer)模型。我们还将这些深度神经网络架构与传统的机器学习架构如支持向量机(SVM)进行了性能比较。为了在训练数据较少的情况下实现更好的泛化效果,我们利用了迁移学习技术。在测试的所有模型中,ResNet50 在其他 CNN 和基于变压器的模型之上表现最佳,使其成为该任务的最佳选择。
https://arxiv.org/abs/2501.06740
This paper presents the application of Kolmogorov-Arnold Networks (KAN) in classifying metal surface defects. Specifically, steel surfaces are analyzed to detect defects such as cracks, inclusions, patches, pitted surfaces, and scratches. Drawing on the Kolmogorov-Arnold theorem, KAN provides a novel approach compared to conventional multilayer perceptrons (MLPs), facilitating more efficient function approximation by utilizing spline functions. The results show that KAN networks can achieve better accuracy than convolutional neural networks (CNNs) with fewer parameters, resulting in faster convergence and improved performance in image classification.
本文介绍了Kolmogorov-Arnold网络(KAN)在金属表面缺陷分类中的应用,特别是用于检测钢铁表面的裂纹、夹杂、斑点、凹坑和划痕等缺陷。基于Kolmogorov-Arnold定理,KAN提供了一种与传统的多层感知器(MLP)不同的新方法,通过使用样条函数实现了更高效的函数逼近。实验结果表明,相比于卷积神经网络(CNN),KAN网络在参数较少的情况下可以达到更高的准确率,并且收敛速度更快,图像分类性能更好。
https://arxiv.org/abs/2501.06389
With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some groups of feed-forward sublayers exhibit high activation similarity, which may help explain their surprising mergeability.
随着大型深度学习模型的兴起和普及,为了广泛部署这些模型,对高质量压缩技术的需求日益增长。这些模型庞大的参数数量使得它们难以适应不同硬件的内存限制。在这项工作中,我们提出了一种新的模型压缩方法,通过合并模型内的相似参数组而非剪除不太重要的参数来进行压缩。具体来说,我们在Transformer模型中选择、对齐并合并单独的前馈子层,并在语言建模、图像分类和机器翻译任务上测试了我们的方法。使用我们的方法,在合并超过三分之一模型前馈子层的同时,我们展示了与原始模型相当的性能,并且超越了一个强大的层剪枝基线的表现。例如,我们可以从Vision Transformer中移除超过21%的总参数量,同时保持其原始性能的99%。此外,我们观察到一些前馈子层组表现出高激活相似性,这可能有助于解释它们令人惊讶的可合并性。
https://arxiv.org/abs/2501.06126
Even if Application-Specific Integrated Circuits (ASIC) have proven to be a relevant choice for integrating inference at the edge, they are often limited in terms of applicability. In this paper, we demonstrate that an ASIC neural network accelerator dedicated to image processing can be applied to multiple tasks of different levels: image classification and compression, while requiring a very limited hardware. The key component is a reconfigurable, mixed-precision (3b/2b/1b) encoder that takes advantage of proper weight and activation quantizations combined with convolutional layer structural pruning to lower hardware-related constraints (memory and computing). We introduce an automatic adaptation of linear symmetric quantizer scaling factors to perform quantized levels equalization, aiming at stabilizing quinary and ternary weights training. In addition, a proposed layer-shared Bit-Shift Normalization significantly simplifies the implementation of the hardware-expensive Batch Normalization. For a specific configuration in which the encoder design only requires 1Mb, the classification accuracy reaches 87.5% on CIFAR-10. Besides, we also show that this quantized encoder can be used to compress image patch-by-patch while the reconstruction can performed remotely, by a dedicated full-frame decoder. This solution typically enables an end-to-end compression almost without any block artifacts, outperforming patch-based state-of-the-art techniques employing a patch-constant bitrate.
即使专用集成电路(ASIC)已经被证明是边缘设备集成推理任务的一个相关选择,它们在适用性方面通常受到限制。在这篇论文中,我们展示了专门为图像处理设计的ASIC神经网络加速器可以应用于多个不同层次的任务:如图像分类和压缩,并且只需要非常有限的硬件资源。 关键组件是一个可重构、混合精度(3b/2b/1b)的编码器,它利用适当的权重和激活量化与卷积层结构剪枝相结合的方法来降低硬件相关的限制(内存和计算)。我们引入了一个自动调整线性对称量化因子以执行量化级别均衡的技术,旨在稳定五元组和三元组权重训练。此外,提出的共享Bit-Shift归一化方法大大简化了硬件成本高昂的批量归一化的实现。 对于一个特定配置,在编码器设计仅需1Mb的情况下,CIFAR-10数据集上的分类准确率达到87.5%。另外,我们还展示了这种量化编码器可以逐块压缩图像,并且可以在远程通过专门的全帧解码器进行重建。此方案通常能够实现几乎无任何块状伪影的端到端压缩效果,优于使用固定比特率的区块技术的方法。
https://arxiv.org/abs/2501.05097
In the medical field, accurate diagnosis of lung cancer is crucial for treatment. Traditional manual analysis methods have significant limitations in terms of accuracy and efficiency. To address this issue, this paper proposes a deep learning network framework based on the pre-trained MobileNetV2 model, initialized with weights from the ImageNet-1K dataset (version 2). The last layer of the model (the fully connected layer) is replaced with a new fully connected layer, and a softmax activation function is added to efficiently classify three types of lung cancer CT scan images. Experimental results show that the model achieves an accuracy of 99.6% on the test set, with significant improvements in feature extraction compared to traditional this http URL the rapid development of artificial intelligence technologies, deep learning applications in medical image processing are bringing revolutionary changes to the healthcare industry. AI-based lung cancer detection systems can significantly improve diagnostic efficiency, reduce the workload of doctors, and occupy an important position in the global healthcare market. The potential of AI to improve diagnostic accuracy, reduce medical costs, and promote precision medicine will have a profound impact on the future development of the healthcare industry.
在医学领域,准确诊断肺癌对于治疗至关重要。传统的手动分析方法在准确性与效率方面存在显著局限性。为了解决这一问题,本文提出了一种基于预训练的MobileNetV2模型(从ImageNet-1K数据集版本2中初始化权重)的深度学习网络框架。模型的最后一层(全连接层)被替换为一个新的全连接层,并添加了softmax激活函数以高效地分类三种类型的肺癌CT扫描图像。实验结果显示,该模型在测试集中达到了99.6%的准确率,在特征提取方面相比传统方法有显著改进。 随着人工智能技术的快速发展,深度学习在医学影像处理中的应用正在为医疗行业带来革命性的变化。基于AI的肺癌检测系统可以大幅提高诊断效率、减轻医生的工作负担,并在全球医疗卫生市场中占据重要地位。人工智能在提升诊断准确性、降低医疗成本和促进精准医疗方面的潜力将对未来医疗行业的未来发展产生深远影响。
https://arxiv.org/abs/2501.04996