This paper reveals that every image can be understood as a first-order norm+linear autoregressive process, referred to as FINOLA, where norm+linear denotes the use of normalization before the linear model. We demonstrate that images of size 256$\times$256 can be reconstructed from a compressed vector using autoregression up to a 16$\times$16 feature map, followed by upsampling and convolution. This discovery sheds light on the underlying partial differential equations (PDEs) governing the latent feature space. Additionally, we investigate the application of FINOLA for self-supervised learning through a simple masked prediction technique. By encoding a single unmasked quadrant block, we can autoregressively predict the surrounding masked region. Remarkably, this pre-trained representation proves effective for image classification and object detection tasks, even in lightweight networks, without requiring fine-tuning. The code will be made publicly available.
这篇文章表明,每个图像都可以被视为一个第一阶 norms+线性 autoregressive 过程,也称为 FINOLA,其中 norms+线性表示在线性模型之前使用标准化。我们证明了,大小为 256x256 的图像可以通过自回归从压缩向量重构到 16x16 特征图,然后进行增广和卷积。这个发现揭示了支配潜在特征空间的基 partial differential equations (PDEs)。此外,我们通过简单的蒙面预测技术研究了 FINOLA 对自监督学习的应用。通过编码一个未暴露的 Quadrant 块,我们可以自回归预测周围的蒙面区域。令人惊讶地,这个预训练表示证明对于图像分类和物体检测任务有效,即使在轻量级网络中,也不需要微调。代码将公开可用。
https://arxiv.org/abs/2305.16319
For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.
对计算机视觉任务而言,视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发,ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题,我们提出了ViTs中的每个模块的全新设计,例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块,我们实现了真正的变换同构ViTs,对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证,在理论和实践上都实现了100%的变换一致性。具体来说,我们在实践中测试了这些模型的图像分类和语义分割性能,在不同数据集上取得了竞争表现,同时保持了100%的变换一致性。
https://arxiv.org/abs/2305.16316
We propose SING (StabIlized and Normalized Gradient), a plug-and-play technique that improves the stability and generalization of the Adam(W) optimizer. SING is straightforward to implement and has minimal computational overhead, requiring only a layer-wise standardization of the gradients fed to Adam(W) without introducing additional hyper-parameters. We support the effectiveness and practicality of the proposed approach by showing improved results on a wide range of architectures, problems (such as image classification, depth estimation, and natural language processing), and in combination with other optimizers. We provide a theoretical analysis of the convergence of the method, and we show that by virtue of the standardization, SING can escape local minima narrower than a threshold that is inversely proportional to the network's depth.
我们提出了SING(稳定和归一化梯度),这是一个可插拔的技术,可以提高Adam(W)优化器的稳定性和泛化能力。SING易于实现,并具有最小的计算 overhead,只需要在每个層上进行标准化,而不引入额外的超参数。我们支持提出的方法和其有效性和实用性,通过展示在各种架构、问题(如图像分类、深度估计和自然语言处理)以及与其他优化器的组合下改进的结果。我们提供了该方法的收敛理论分析,并表明通过标准化,SING可以逃避比阈值更窄的局部最小值。
https://arxiv.org/abs/2305.15997
In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn. To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD. Besides teacher and student, TriKD employs a third role called anchor model. Before distillation begins, the pre-trained anchor model delimits a subspace within the full solution space of the target problem. Solutions within the subspace are expected to be easy targets that the student could mimic well. Distillation then begins in an online manner, and the teacher is only allowed to express solutions within the aforementioned subspace. Surprisingly, benefiting from accurate but easy-to-mimic hints, the student can finally perform well. After the student is well trained, it can be used as the new anchor for new students, forming a curriculum learning strategy. Our experiments on image classification and face recognition with various models clearly demonstrate the effectiveness of our method. Furthermore, the proposed TriKD is also effective in dealing with the overfitting issue. Moreover, our theoretical analysis supports the rationality of our triplet distillation.
在知识蒸馏中,通常老师比学生大得多,这使得老师的解决方案可能对学生来说难以理解。为了减轻模仿困难,我们引入了名为TriKD的三知识蒸馏机制。除了老师和学生,TriKD还雇用了一个叫锚模型的第三角色。在蒸馏开始之前,预先训练的锚模型在目标问题的完整解决方案空间中限定了一个子空间。在子空间内的解决方案期望是学生可以轻松模仿的目标。蒸馏开始在线进行,老师只能表达上述子空间内的解决方案。令人惊讶地,受益于准确的但易于模仿的线索,学生最终能够表现出色。一旦学生经过训练,它可以用作新学生的新锚,形成课程学习策略。我们对各种模型的图像分类和人脸识别实验清楚地证明了我们方法的有效性。此外,我们提出的TriKD也 effective in处理过拟合问题。此外,我们的理论分析支持我们三知识蒸馏的 rationality。
https://arxiv.org/abs/2305.15975
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
大型预训练模型通过实现多模态学习对计算机视觉产生了重大影响。CLIP模型在图像分类、对象检测和语义分割等方面取得了令人印象深刻的结果。然而,模型在3D点云处理任务方面的性能受到3D投影和CLIP训练图像之间的域差的限制。本文提出了DiffCLIP,一个新的预训练框架,结合稳定的扩散控制Net,最小化视觉分支中的域差。此外,在文本分支中引入了少量的任务风格prompt generation模块。在ModelNet10、ModelNet40和扫描对象NN数据集上进行广泛的实验表明,DiffCLIP具有很强的3D理解能力。通过稳定的扩散和风格prompt generation,DiffCLIP实现了对扫描对象NN中 obj_bg 对象零样本分类的准确率为43.2%,这是当前最先进的性能,而ModelNet10中的对象零样本分类的准确率为80.6%,与当前最先进的性能相当。
https://arxiv.org/abs/2305.15957
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code will be available at this https URL.
教师和学生之间的表示差距是知识蒸馏(KD)领域的一个新兴话题。为了缩小差距并提高性能,当前的方法常常采用复杂的训练计划、损失函数和特征对齐,这些任务和特征特定的。在本文中,我们指出这些方法的核心是排除噪声信息并蒸馏特征中的有价值信息,并提出了一种新的KD方法称为DiffKD,使用扩散模型来明确消除特征。我们的的方法是基于观察,学生特征通常包含比教师特征更多的噪声,因为学生模型的容量较小。为了解决这一问题,我们提议使用教师特征训练的扩散模型来消除学生特征。这允许我们在 refined clean feature 和教师特征之间的蒸馏任务中更好地进行知识蒸馏。此外,我们介绍了一种轻量级扩散模型,并配置了一个线性自编码器,以降低计算成本,并引入了一种自适应噪声匹配模块,以提高去噪性能。广泛的实验表明,DiffKD 适用于各种特征类型,并在图像分类、对象检测和语义分割任务中实现了最先进的性能。代码将在本链接中提供。
https://arxiv.org/abs/2305.15712
As the adoption of AI systems within the clinical setup grows, limitations in bandwidth could create communication bottlenecks when streaming imaging data, leading to delays in patient diagnosis and treatment. As such, healthcare providers and AI vendors will require greater computational infrastructure, therefore dramatically increasing costs. To that end, we developed intelligent streaming, a state-of-the-art framework to enable accelerated, cost-effective, bandwidth-optimized, and computationally efficient AI inference for clinical decision making at scale. For classification, intelligent streaming reduced the data transmission by 99.01% and decoding time by 98.58%, while increasing throughput by 27.43x. For segmentation, our framework reduced data transmission by 90.32%, decoding time by 90.26%, while increasing throughput by 4.20x. Our work demonstrates that intelligent streaming results in faster turnaround times, and reduced overall cost of data and transmission, without negatively impacting clinical decision making using AI systems.
随着在临床setup中采用人工智能技术的增加,带宽的限制可能会在流式图像数据时形成通信瓶颈,导致 patient 诊断和治疗延误。因此,医疗保健提供者和 AI 供应商将需要更大的计算基础设施,因此大大提高了成本。为此,我们开发了智能流式处理框架,这是一种先进的框架,以实现加速、成本效益高、带宽优化和计算高效的 AI 推断,并在大规模临床决策中实现。对于分类,智能流式处理框架将数据传输效率降低了99.01%,解码时间降低了98.58%,同时提高了吞吐量27.43倍。对于分割,我们的框架将数据传输效率降低了90.32%,解码时间降低了90.26%,同时提高了吞吐量4.20倍。我们的工作表明,智能流式处理结果加快了处理速度,减少了数据和传输的总体成本,而不会负面影响使用 AI 系统进行临床决策。
https://arxiv.org/abs/2305.15617
Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.
获取高质量的数据用于训练区分模型是构建有效预测系统的关键而又具有挑战性方面。在本文中,我们介绍了扩散逆转(Diffusion Inversion)方法,这是一种简单而有效的方法,利用预先训练的生成模型稳定扩散,以生成多种高质量的图像分类训练数据。我们的方法捕捉了原始数据分布,并确保数据覆盖,通过反转图像到稳定扩散的潜在空间,并通过对这些向量的噪声版本 conditioning 生成模型,生成多种新颖的训练图像。我们确定了三个关键组件,这些组件允许我们生成的图像成功地取代了原始数据集,导致样本复杂性的提高2-3倍,采样时间减少6.5倍。此外,我们的方法在多个数据集上 consistently outperforms 通用 prompt-based指导方法和 KNN 检索基准方法。此外,我们展示了我们的方法与广泛应用的数据增强技术的兼容性,以及生成数据的可靠性,以支持各种神经网络架构并增强少量学习。
https://arxiv.org/abs/2305.15316
Despite their impressive performance in classification, neural networks are known to be vulnerable to adversarial attacks. These attacks are small perturbations of the input data designed to fool the model. Naturally, a question arises regarding the potential connection between the architecture, settings, or properties of the model and the nature of the attack. In this work, we aim to shed light on this problem by focusing on the implicit bias of the neural network, which refers to its inherent inclination to favor specific patterns or outcomes. Specifically, we investigate one aspect of the implicit bias, which involves the essential Fourier frequencies required for accurate image classification. We conduct tests to assess the statistical relationship between these frequencies and those necessary for a successful attack. To delve into this relationship, we propose a new method that can uncover non-linear correlations between sets of coordinates, which, in our case, are the aforementioned frequencies. By exploiting the entanglement between intrinsic dimension and correlation, we provide empirical evidence that the network bias in Fourier space and the target frequencies of adversarial attacks are closely tied.
尽管神经网络在分类方面表现惊人,但它们仍然容易受到dversarial攻击。这些攻击是输入数据中的微小的扰动,旨在欺骗模型。自然地,一个问题出现了,即模型的结构、设置或属性与攻击的性质之间可能存在的潜在联系。在本研究中,我们旨在通过关注神经网络的隐含偏见来解决这个问题,隐含偏见是指其固有的倾向,以偏好特定的模式或结果。具体而言,我们研究了隐含偏见的一个方面,涉及准确图像分类所需的关键傅里叶频率。我们进行了测试,以评估这些频率和成功攻击所需的统计关系。为了深入探讨这个问题,我们提出了一种新方法,可以揭示一组坐标之间的非线性相关关系,这些坐标是指在我们的情况下,上述频率。通过利用内在维度和相关性的纠缠,我们提供了经验证据,表明在 Fourier 空间中的网络偏见和dversarial 攻击的目标频率之间是紧密联系的。
https://arxiv.org/abs/2305.15203
This paper introduces a novel attention mechanism, called dual attention, which is both efficient and effective. The dual attention mechanism consists of two parallel components: local attention generated by Convolutional Neural Networks (CNNs) and long-range attention generated by Vision Transformers (ViTs). To address the high computational complexity and memory footprint of vanilla Multi-Head Self-Attention (MHSA), we introduce a novel Multi-Head Partition-wise Attention (MHPA) mechanism. The partition-wise attention approach models both intra-partition and inter-partition attention simultaneously. Building on the dual attention block and partition-wise attention mechanism, we present a hierarchical vision backbone called DualFormer. We evaluate the effectiveness of our model on several computer vision tasks, including image classification on ImageNet, object detection on COCO, and semantic segmentation on Cityscapes. Specifically, the proposed DualFormer-XS achieves 81.5\% top-1 accuracy on ImageNet, outperforming the recent state-of-the-art MPViT-XS by 0.6\% top-1 accuracy with much higher throughput.
本 paper 介绍了一种名为 "双重注意力" 的新注意力机制,这种机制既高效又有效。双重注意力机制由两个并行组成部分构成:由卷积神经网络(CNN)产生的局部注意力和由视觉转换器(ViTs)产生的远程注意力。为了解决传统多目自注意力(MHSA)的高计算复杂度和内存 footprint,我们引入了一种新的多目分块注意力(MHPA)机制。分块注意力方法可以同时建模内部分块和外部分块注意力。基于双重注意力块和分块注意力机制,我们提出了名为 "双重 Former" 的层级视觉骨架,该骨架在 ImageNet 图像分类任务中取得了 81.5 %的 top-1 准确率,比最新的 MPViT-XS 先进 0.6 %的 top-1 准确率,且吞吐量更高。具体来说, proposed 的双重 Former-XS 在 ImageNet 任务中取得了 81.5 %的 top-1 准确率,比最新的 MPViT-XS 先进 0.6 %的 top-1 准确率,且具有更高的吞吐量。
https://arxiv.org/abs/2305.14768
We attempt to estimate the spatial distribution of heterogeneous physical parameters involved in the formation of magnetic domain patterns of polycrystalline thin films by using convolutional neural networks. We propose a method to obtain a spatial map of physical parameters by estimating the parameters from patterns within a small subregion window of the full magnetic domain and subsequently shifting this window. To enhance the accuracy of parameter estimation in such subregions, we employ employ large-scale models utilized for natural image classification and exploit the benefits of pretraining. Using a model with high estimation accuracy on these subregions, we conduct inference on simulation data featuring spatially varying parameters and demonstrate the capability to detect such parameter variations.
我们用卷积神经网络尝试估算多晶片薄板磁 domains patterns 中涉及的不同物质参数的空间分布。我们提出了一种方法,通过从整个磁 domains 的小区域窗口中估计参数,然后移动这个窗口来得到物质参数的空间地图。为了提高在这些子区域中的参数估计精度,我们使用了用于自然图像分类的大型模型,并利用预训练的优势。使用在这些子区域中具有高精度估计模型的模型,我们对模拟数据进行了推断,其中包含空间分布参数的变化,并展示了能够检测这种参数变化的能力。
https://arxiv.org/abs/2305.14764
Unsupervised learning has grown in popularity because of the difficulty of collecting annotated data and the development of modern frameworks that allow us to learn from unlabeled data. Existing studies, however, either disregard variations at different levels of similarity or only consider negative samples from one batch. We argue that image pairs should have varying degrees of similarity, and the negative samples should be allowed to be drawn from the entire dataset. In this work, we propose Search-based Unsupervised Visual Representation Learning (SUVR) to learn better image representations in an unsupervised manner. We first construct a graph from the image dataset by the similarity between images, and adopt the concept of graph traversal to explore positive samples. In the meantime, we make sure that negative samples can be drawn from the full dataset. Quantitative experiments on five benchmark image classification datasets demonstrate that SUVR can significantly outperform strong competing methods on unsupervised embedding learning. Qualitative experiments also show that SUVR can produce better representations in which similar images are clustered closer together than unrelated images in the latent space.
非监督学习因其收集标注数据的困难以及现代框架的发展而变得越来越受欢迎。然而,现有的研究要么忽略了不同相似性水平下的变化,要么只考虑了一组样本中的消极样本。我们认为图像对应该具有不同程度的相似性,并且消极样本应该从整个数据集随机抽取。在本研究中,我们提出了基于搜索的非监督视觉表示学习(SUVR),以在没有监督嵌入学习的情况下学习更好的图像表示。我们首先通过图像之间的相似性构建图像集的图,并采用图遍历的概念来探索积极样本。同时,我们确保可以从整个数据集随机抽取消极样本。对五个基准图像分类数据集进行定量实验表明,SUVR可以在无监督嵌入学习中显著优于强大的竞争方法。定性实验也表明,SUVR可以在潜在空间中相似的图像簇在一起,比无关的图像在分离空间中更紧密地聚集在一起,从而生成更好的表示。
https://arxiv.org/abs/2305.14754
In real-world applications, it is essential to jointly estimate the 3D object pose and class label of objects, i.e., to perform 3D-aware classification.While current approaches for either image classification or pose estimation can be extended to 3D-aware classification, we observe that they are inherently limited: 1) Their performance is much lower compared to the respective single-task models, and 2) they are not robust in out-of-distribution (OOD) scenarios. Our main contribution is a novel architecture for 3D-aware classification, which builds upon a recent work and performs comparably to single-task models while being highly robust. In our method, an object category is represented as a 3D cuboid mesh composed of feature vectors at each mesh vertex. Using differentiable rendering, we estimate the 3D object pose by minimizing the reconstruction error between the mesh and the feature representation of the target image. Object classification is then performed by comparing the reconstruction losses across object categories. Notably, the neural texture of the mesh is trained in a discriminative manner to enhance the classification performance while also avoiding local optima in the reconstruction loss. Furthermore, we show how our method and feed-forward neural networks can be combined to scale the render-and-compare approach to larger numbers of categories. Our experiments on PASCAL3D+, occluded-PASCAL3D+, and OOD-CV show that our method outperforms all baselines at 3D-aware classification by a wide margin in terms of performance and robustness.
在现实世界的应用中,必须一起估计物体的三维姿态和类别标签,也就是进行3D aware分类。虽然当前的图像分类或姿态估计方法可以扩展到3D aware分类,但我们观察到它们本质上是有限的:1)它们的性能比相应的单任务模型低得多,2)在分布不均的情况(OOD)中,它们不够稳健。我们的主要贡献是一种新型架构,它基于最近的研究成果,在性能和鲁棒性方面与单任务模型相当。在我们的方法中,物体类别被表示为3D立方体网格,每个网格顶点上的特征向量组成。使用不同的渲染方式,我们通过最小化网格和目标图像特征表示之间的重建误差来估计3D物体姿态。然后,我们通过比较物体类别之间的重建损失来进行物体分类。值得注意的是,网格的神经网络纹理被训练以增强分类性能,同时避免重建损失中的局部最优解。此外,我们展示了如何与我们的方法和前向神经网络组合起来,将渲染和比较方法扩展到更多的类别。我们在PASCAL3D+、遮挡PASCAL3D+和OOD-CV上进行了实验,结果表明,我们的方法在3D aware分类中表现优异,性能和鲁棒性方面远远超过所有基准方法。
https://arxiv.org/abs/2305.14668
The topic of achieving rotational invariance in convolutional neural networks (CNNs) has gained considerable attention recently, as this invariance is crucial for many computer vision tasks such as image classification and matching. In this letter, we propose a Sorting Convolution (SC) inspired by some hand-crafted features of texture images, which achieves continuous rotational invariance without requiring additional learnable parameters or data augmentation. Further, SC can directly replace the conventional convolution operations in a classic CNN model to achieve its rotational invariance. Based on MNIST-rot dataset, we first analyze the impact of convolutional kernel sizes, different sampling and sorting strategies on SC's rotational invariance, and compare our method with previous rotation-invariant CNN models. Then, we combine SC with VGG, ResNet and DenseNet, and conduct classification experiments on popular texture and remote sensing image datasets. Our results demonstrate that SC achieves the best performance in the aforementioned tasks.
在卷积神经网络(CNN)中实现旋转不变的主题最近吸引了相当大的关注,因为这对于许多计算机视觉任务,如图像分类和匹配,是至关重要的。在本信中,我们提出了一种基于纹理图像手工特征的排序卷积(SC),该卷积实现了连续的旋转不变性,而不需要额外的可学习参数或数据增强。此外,SC可以直接在经典CNN模型中替代传统的卷积操作来实现其旋转不变性。基于MNIST-rot数据集,我们首先分析了卷积内核大小、不同采样和排序策略对SC的旋转不变的影响,并比较了我们的方法与其他旋转不变的CNN模型。随后,我们结合SC与VGG、ResNet和DenseNet,在流行的纹理和遥感图像数据集上开展分类实验。我们的结果表明,SC在这些任务中取得了最佳表现。
https://arxiv.org/abs/2305.14462
Deep neural networks (DNNs) have made remarkable strides in various computer vision tasks, including image classification, segmentation, and object detection. However, recent research has revealed a vulnerability in advanced DNNs when faced with deliberate manipulations of input data, known as adversarial attacks. Moreover, the accuracy of DNNs is heavily influenced by the distribution of the training dataset. Distortions or perturbations in the color space of input images can introduce out-of-distribution data, resulting in misclassification. In this work, we propose a brightness-variation dataset, which incorporates 24 distinct brightness levels for each image within a subset of ImageNet. This dataset enables us to simulate the effects of light and shadow on the images, so as is to investigate the impact of light and shadow on the performance of DNNs. In our study, we conduct experiments using several state-of-the-art DNN architectures on the aforementioned dataset. Through our analysis, we discover a noteworthy positive correlation between the brightness levels and the loss of accuracy in DNNs. Furthermore, we assess the effectiveness of recently proposed robust training techniques and strategies, including AugMix, Revisit, and Free Normalizer, using the ResNet50 architecture on our brightness-variation dataset. Our experimental results demonstrate that these techniques can enhance the robustness of DNNs against brightness variation, leading to improved performance when dealing with images exhibiting varying brightness levels.
深度神经网络(DNN)在多种计算机视觉任务中取得了显著的进展,包括图像分类、分割和物体检测。然而,最近的研究揭示了高级DNN在面临有意地对输入数据进行操作的攻击时的一个脆弱性,这种攻击称为对抗攻击。此外,DNN的准确性受到训练数据分布的强烈影响。输入图像的颜色空间中的扭曲或扰动可以引入不在分布的数据,导致分类错误。在本文中,我们提出了一个亮度变化数据集,该数据集包括ImageNet中每个图像的24个不同亮度水平。这个数据集使我们能够模拟光和暗对图像的影响,以研究光和暗对DNN性能的影响。在我们的研究中,我们使用了几个先进的DNN架构在上文提到的数据集上进行实验。通过我们的分析,我们发现了DNN亮度水平与准确性损失之间存在显著 positive correllation。此外,我们评估了最近提出的 robust 训练技术和策略,包括 Aug Mix、Revisit和Free Normalizer,使用ResNet50架构在我们的亮度变化数据集上。我们的实验结果表明,这些技术可以增强DNN对亮度变化的抵抗力,导致处理具有不同亮度水平的图像时性能得到改善。
https://arxiv.org/abs/2305.14165
Vector-Borne Disease (VBD) is an infectious disease transmitted through the pathogenic female Aedes mosquito to humans and animals. It is important to control dengue disease by reducing the spread of Aedes mosquito vectors. Community awareness plays acrucial role to ensure Aedes control programmes and encourages the communities to involve active participation. Identifying the species of mosquito will help to recognize the mosquito density in the locality and intensifying mosquito control efforts in particular areas. This willhelp in avoiding Aedes breeding sites around residential areas and reduce adult mosquitoes. To serve this purpose, an android application are developed to identify Aedes species that help the community to contribute in mosquito control events. Several Android applications have been developed to identify species like birds, plant species, and Anopheles mosquito species. In this work, a user-friendly mobile application mAedesID is developed for identifying the Aedes mosquito species using a deep learning Convolutional Neural Network (CNN) algorithm which is best suited for species image classification and achieves better accuracy for voluminous images. The mobile application can be downloaded from the URLhttps://tinyurl.com/mAedesID.
蚊子媒染疾病(VBD)是通过传播致病性雌蚊子向人类和动物传播的传染病。控制 Dengue 疾病需要减少蚊子的媒介传播。社区意识至关重要,以确保 Dengue 控制方案得到执行,并鼓励社区积极参与。识别蚊子物种可以帮助识别当地蚊子密度,加强特定地区的蚊子控制工作。这将有助于避免在居民区周围发现 Dengue 繁殖地并减少成年蚊子的数量。为了实现这一目标,开发了一款 android 应用程序,用于识别 Dengue 物种,帮助社区参与蚊子控制活动。多个 android 应用程序已开发用于识别物种,如鸟类、植物和 Anopheles 蚊子物种。在这个项目中,一款易于使用的手机应用程序 mAedesID 被开发,使用深度学习卷积神经网络(CNN)算法,最适合物种图像分类,并在大量图像中实现更好的准确性。手机应用程序可以从网址https://tinyurl.com/mAedesID下载。
https://arxiv.org/abs/2305.07664
Continual Learning (CL) aims at incrementally learning new tasks without forgetting the knowledge acquired from old ones. Experience Replay (ER) is a simple and effective rehearsal-based strategy, which optimizes the model with current training data and a subset of old samples stored in a memory buffer. To further reduce forgetting, recent approaches extend ER with various techniques, such as model regularization and memory sampling. However, the prediction consistency between the new model and the old one on current training data has been seldom explored, resulting in less knowledge preserved when few previous samples are available. To address this issue, we propose a CL method with Strong Experience Replay (SER), which additionally utilizes future experiences mimicked on the current training data, besides distilling past experience from the memory buffer. In our method, the updated model will produce approximate outputs as its original ones, which can effectively preserve the acquired knowledge. Experimental results on multiple image classification datasets show that our SER method surpasses the state-of-the-art methods by a noticeable margin.
持续学习(CL)的目标是逐步学习新任务,而不忘记从旧任务中获取的知识。经验回放(ER)是一种简单而有效的模拟策略,它使用当前训练数据和存储在内存缓冲中的旧样本来优化模型。为了进一步减少遗忘,最近的一些方法使用各种技术,如模型正则化和情感分析,来扩展ER。然而,在新模型和旧模型在当前训练数据上的预测一致性方面,很少去探索,因此在可用旧样本较少时,保留的知识较少。为了解决这一问题,我们提出了一种使用强经验回放(SER)的CL方法,它同时利用当前训练数据模拟的未来经验,并从内存缓冲中摘要过去的经验。在我们的方法中,更新的模型将产生与原始模型相似的近似输出,这可以有效保留获取的知识。多个图像分类数据集的实验结果显示,我们的 SER 方法比当前最先进的方法高出显著 margin。
https://arxiv.org/abs/2305.13622
Monitoring plant health is crucial for maintaining agricultural productivity and food safety. Disruptions in the plant's normal state, caused by diseases, often interfere with essential plant activities, and timely detection of these diseases can significantly mitigate crop loss. In this study, we propose a deep learning-based approach for efficient detection of plant diseases using drone-captured imagery. A comprehensive database of various plant species, exhibiting numerous diseases, was compiled from the Internet and utilized as the training and test dataset. A Convolutional Neural Network (CNN), renowned for its performance in image classification tasks, was employed as our primary predictive model. The CNN model, trained on this rich dataset, demonstrated superior proficiency in crop disease categorization and detection, even under challenging imaging conditions. For field implementation, we deployed a prototype drone model equipped with a high-resolution camera for live monitoring of extensive agricultural fields. The captured images served as the input for our trained model, enabling real-time identification of healthy and diseased plants. Our approach promises an efficient and scalable solution for improving crop health monitoring systems.
监测植物健康对于维持农业生产力和食品安全至关重要。由疾病引起的植物正常状态的 disruption 常常干扰 essential 植物活动,而及时检测这些疾病可以显著减轻作物损失。在本研究中,我们提出了一种基于深度学习的方法,利用无人机捕获的图像来进行高效的植物疾病检测。从互联网上收集了多种植物物种的全面数据库,展示了许多疾病,作为训练和测试数据集使用。我们采用了著名的卷积神经网络(CNN),因其在图像分类任务中的表现而知名,将其作为主要预测模型。通过训练这个丰富的数据集,CNN 模型在作物疾病分类和检测方面表现出卓越的能力,即使在挑战性的图像处理条件下也是如此。为了在实践中实现,我们部署了一台配备了高分辨率相机的原型无人机,用于实时监测广泛的农田。捕获的图像作为我们训练模型的输入,实现了实时识别健康和患病的植物。我们的方法提供了一个高效且可扩展的解决方案,以改善作物健康监测系统。
https://arxiv.org/abs/2305.13490
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
平方根 laws 最近被用于推导计算最优模型大小(参数数量)给定计算时长。我们 advance 和 refine 这些方法来推断计算最优模型形状,如宽度和深度,并成功在视觉变换器中实现。我们的形状优化视觉变换器 SoViT,尽管在相同的设置下预先训练了同等数量的计算,却能够与超过其两倍大小models 竞争结果。例如,SoViT-400m/14 在ILSRCV2012中 Fine-tuning accuracy 达到90.3%,超越了大得多的 ViT-g/14,并在相同的设置下接近 ViT-G/14,推理成本不到一半。我们进行了广泛的评估,包括图像分类、标题生成、VQA 和零样本传输等任务,证明了我们模型在广泛领域上的 effectiveness 并识别了限制。总的来说,我们的研究成果挑战了盲目增长视觉模型的主流方法,并为更明智的增长铺平了道路。
https://arxiv.org/abs/2305.13035
Memristor-based neural networks provide an exceptional energy-efficient platform for artificial intelligence (AI), presenting the possibility of self-powered operation when paired with energy harvesters. However, most memristor-based networks rely on analog in-memory computing, necessitating a stable and precise power supply, which is incompatible with the inherently unstable and unreliable energy harvesters. In this work, we fabricated a robust binarized neural network comprising 32,768 memristors, powered by a miniature wide-bandgap solar cell optimized for edge applications. Our circuit employs a resilient digital near-memory computing approach, featuring complementarily programmed memristors and logic-in-sense-amplifier. This design eliminates the need for compensation or calibration, operating effectively under diverse conditions. Under high illumination, the circuit achieves inference performance comparable to that of a lab bench power supply. In low illumination scenarios, it remains functional with slightly reduced accuracy, seamlessly transitioning to an approximate computing mode. Through image classification neural network simulations, we demonstrate that misclassified images under low illumination are primarily difficult-to-classify cases. Our approach lays the groundwork for self-powered AI and the creation of intelligent sensors for various applications in health, safety, and environment monitoring.
Memristor-based神经网络为人工智能(AI)提供了卓越的能源效率平台,当与能量收集器配对使用时,可能实现自驱动运行。然而,大多数Memristor-based网络依赖于模拟存储器计算,因此需要稳定且精确的电源供应,这与天生不稳定且可靠的能量收集器不兼容。在这项工作中,我们制备了一个由32,768个Memristor组成的二进制化神经网络,由小型优化应用于边缘应用的宽频gap太阳能电池驱动。我们的电路采用了一种 resilient 数字近记忆计算方法,包括互补编程的Memristor和逻辑集成sense放大器。这种设计消除了补偿或校准的需求,在各种条件下有效运行。在高光照条件下,电路实现了与实验室电源供应相当的认知性能。在低光照条件下,它仍然有效,但精度略有降低,无缝过渡到近似计算模式。通过图像分类神经网络模拟,我们证明,在低光照条件下错误分类的图像主要是难以分类的情况。我们的方法为自驱动AI和用于健康、安全和环境监测各种应用的智能传感器奠定了基础。
https://arxiv.org/abs/2305.12875