We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses transfer learning and search space pruning. First, the supernet is pre-trained on a classification task, for which large datasets are available. Second, the search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove the candidates over a wide range of resource constraints, we particularly design a performance predictor, called path filter, which can accurately predict the relative performance of the models that satisfy similar resource constraints. Hence, supernet training is more focused on the best-performing candidates. Our path filter handles prediction for paths with different resource budgets. Compared to once-for-all, our proposed method reduces the computational cost of the optimal network architecture by 30% and 63%, while yielding better accuracy-floating point operations Pareto front (0.85 and 0.45 points of improvement on average precision for Pascal VOC and COCO, respectively).
我们解决了训练大型超网络用于目标检测任务的挑战,使用了相对较小的训练数据。具体而言,我们提出了一种高效的超网络神经网络架构搜索方法(NAS),该方法使用迁移学习和搜索空间剪枝。首先,超网络在分类任务上进行了预训练,有大量数据可用。其次,超网络定义的搜索空间通过删除预测表现较差的候选模型进行修剪。为了有效地去除在各种资源限制下的候选模型,我们特别设计了性能预测器,称为路径滤波,它能够准确地预测满足类似资源限制的模型的性能相对表现。因此,超网络训练更关注表现最好的候选模型。我们的路径滤波处理不同资源预算下的预测路径。与一次性搜索相比,我们提出的方法降低了最优网络架构的计算成本,下降了30%和63%,同时提供了更好的浮点操作精度 Pareto 前端(分别提高0.85点和0.45点)。
https://arxiv.org/abs/2303.13121
Quantum Architecture Search (QAS) is a promising approach to designing quantum circuits for variational quantum algorithms (VQAs). However, existing QAS algorithms require to evaluate a large number of quantum circuits during the search process, which makes them computationally demanding and limits their applications to large-scale quantum circuits. Recently, predictor-based QAS has been proposed to alleviate this problem by directly estimating the performances of circuits according to their structures with a predictor trained on a set of labeled quantum circuits. However, the predictor is trained by purely supervised learning, which suffers from poor generalization ability when labeled training circuits are scarce. It is very time-consuming to obtain a large number of labeled quantum circuits because the gate parameters of quantum circuits need to be optimized until convergence to obtain their ground-truth performances. To overcome these limitations, we propose GSQAS, a graph self-supervised QAS, which trains a predictor based on self-supervised learning. Specifically, we first pre-train a graph encoder on a large number of unlabeled quantum circuits using a well-designed pretext task in order to generate meaningful representations of circuits. Then the downstream predictor is trained on a small number of quantum circuits' representations and their labels. Once the encoder is trained, it can apply to different downstream tasks. In order to better encode the spatial topology information and avoid the huge dimension of feature vectors for large-scale quantum circuits, we design a scheme to encode quantum circuits as graphs. Simulation results on searching circuit structures for variational quantum eigensolver and quantum state classification show that GSQAS outperforms the state-of-the-art predictor-based QAS, achieving better performance with fewer labeled circuits.
量子架构搜索(QAS)是一种设计用于Variational 量子算法(VQAs)的有前途的方法。然而,现有的 QAS 算法需要在搜索过程中评估大量量子电路,这使得它们变得计算密集,并限制了它们适用于大规模量子电路的应用。最近,基于预测者的 QAS 被提出以减轻这个问题,通过直接估计电路根据其结构的性能,使用训练过的标记量子电路的预测者来训练。然而,预测者采用纯粹的监督学习训练,当标记训练电路有限时,其泛化能力很差。获取大量标记量子电路非常耗时,因为量子电路的门参数需要优化到收敛才能获得其真实性能。为了克服这些限制,我们提出了 GSQAS,一种基于图自监督的 QAS,该自监督 QAS 使用自监督学习训练预测者。具体来说,我们首先使用设计良好的任务目标对未标记的量子电路大量进行图编码,以生成电路有意义的表示。然后,下游预测者使用少量的量子电路表示及其标签进行训练。一旦编码器被训练,它可以应用于不同的下游任务。为了更好地编码空间拓扑信息,并避免大规模特征向量对大规模量子电路的维度,我们设计了一种将量子电路编码为图的方案。在搜索基于Variational 量子近似解器和量子状态分类的电路结构时,模拟结果显示,GSQAS 比最先进的基于预测者的 QAS 表现更好,在更少标记电路的情况下获得更好的性能。
https://arxiv.org/abs/2303.12381
Social ambiance describes the context in which social interactions happen, and can be measured using speech audio by counting the number of concurrent speakers. This measurement has enabled various mental health tracking and human-centric IoT applications. While on-device Socal Ambiance Measure (SAM) is highly desirable to ensure user privacy and thus facilitate wide adoption of the aforementioned applications, the required computational complexity of state-of-the-art deep neural networks (DNNs) powered SAM solutions stands at odds with the often constrained resources on mobile devices. Furthermore, only limited labeled data is available or practical when it comes to SAM under clinical settings due to various privacy constraints and the required human effort, further challenging the achievable accuracy of on-device SAM solutions. To this end, we propose a dedicated neural architecture search framework for Energy-efficient and Real-time SAM (ERSAM). Specifically, our ERSAM framework can automatically search for DNNs that push forward the achievable accuracy vs. hardware efficiency frontier of mobile SAM solutions. For example, ERSAM-delivered DNNs only consume 40 mW x 12 h energy and 0.05 seconds processing latency for a 5 seconds audio segment on a Pixel 3 phone, while only achieving an error rate of 14.3% on a social ambiance dataset generated by LibriSpeech. We can expect that our ERSAM framework can pave the way for ubiquitous on-device SAM solutions which are in growing demand.
社交氛围描述了社交互动的环境,并可以使用语音音频计数来测量,即同时讲话的人数。这种测量已经使各种心理健康跟踪和人为中心的物联网应用得以实现。尽管在设备上的社交氛围测量(SAM)是非常理想的,以确保用户隐私并促进上述应用的广泛采用,但先进的深度学习网络(DNN) powered的SAM解决方案所需的计算复杂性与移动设备通常面临的资源限制相矛盾。此外,只有在临床环境下才存在有限的标记数据或实际可用的数据,这与必要的人类努力一起,进一步挑战了在设备上的SAM解决方案可以实现的准确性。为此,我们提出了一个专门的神经网络架构搜索框架,以能源效率和实时SAM(ERSAM)。具体而言,我们的ERSAM框架可以自动搜索推动移动设备SAM解决方案实现准确性与硬件效率极限的DNN。例如,ERSAM提供的DNN仅在Pixel 3手机上消耗40毫瓦 x 12小时的能量,以及仅产生0.05秒的处理延迟,但对于由LriSpeech生成的社交氛围数据集,仅实现了14.3%的错误率。我们期望我们的ERSAM框架可以为日益增长的设备上的Sam解决方案需求铺平道路。
https://arxiv.org/abs/2303.10727
Acoustic Event Classification (AEC) has been widely used in devices such as smart speakers and mobile phones for home safety or accessibility support. As AEC models run on more and more devices with diverse computation resource constraints, it became increasingly expensive to develop models that are tuned to achieve optimal accuracy/computation trade-off for each given computation resource constraint. In this paper, we introduce a Once-For-All (OFA) Neural Architecture Search (NAS) framework for AEC. Specifically, we first train a weight-sharing supernet that supports different model architectures, followed by automatically searching for a model given specific computational resource constraints. Our experimental results showed that by just training once, the resulting model from NAS significantly outperforms both models trained individually from scratch and knowledge distillation (25.4% and 7.3% relative improvement). We also found that the benefit of weight-sharing supernet training of ultra-small models comes not only from searching but from optimization.
声学事件分类(AEC)广泛应用于智能扬声器和手机等设备,用于家庭安全和无障碍支持。随着AEC模型在更多设备上运行,面临各种计算资源限制,开发适应每种特定计算资源限制的最佳准确性/计算权衡模型变得越来越昂贵。在本文中,我们介绍了一次通用的(OFA)神经网络架构搜索(NAS)框架,以适用于AEC。具体来说,我们首先训练支持不同模型架构的共享权重超级网络,然后自动搜索给定特定计算资源限制的模型。我们的实验结果表明,仅训练一次,NAS产生的模型显著优于从头单独训练的模型和知识蒸馏(25.4%和7.3%相对改善)。我们还发现,共享权重超级网络训练小型模型的好处不仅来自搜索,而是来自优化。
https://arxiv.org/abs/2303.10351
Lung cancer has emerged as a severe disease that threatens human life and health. The precise segmentation of lung regions is a crucial prerequisite for localizing tumors, which can provide accurate information for lung image analysis. In this work, we first propose a lung image segmentation model using the NASNet-Large as an encoder and then followed by a decoder architecture, which is one of the most commonly used architectures in deep learning for image segmentation. The proposed NASNet-Large-decoder architecture can extract high-level information and expand the feature map to recover the segmentation map. To further improve the segmentation results, we propose a post-processing layer to remove the irrelevant portion of the segmentation map. Experimental results show that an accurate segmentation model with 0.92 dice scores outperforms state-of-the-art performance.
肺癌已成为威胁人类生命和健康的严重疾病。精确分割肺部区域是定位肿瘤的至关重要先决条件,可以为肺部图像分析提供准确的信息。在本研究中,我们首先提出了使用NASNet-Large作为编码器的图像分割模型,然后提出了解码器架构,它是深度学习中常用于图像分割的常见架构之一。 proposed NASNet-Large-decoder architecture can extract high-level information and expand the feature map to recover the segmentation map. 为进一步改善分割结果,我们提出了一个后处理层,以删除分割图中无关的部分。实验结果显示,一个具有0.92dice评分的准确分割模型比当前最先进的性能表现更好。
https://arxiv.org/abs/2303.10315
Neural Architecture Search (NAS) has shown promising performance in the automatic design of vision transformers (ViT) exceeding 1G FLOPs. However, designing lightweight and low-latency ViT models for diverse mobile devices remains a big challenge. In this work, we propose ElasticViT, a two-stage NAS approach that trains a high-quality ViT supernet over a very large search space that supports a wide range of mobile devices, and then searches an optimal sub-network (subnet) for direct deployment. However, prior supernet training methods that rely on uniform sampling suffer from the gradient conflict issue: the sampled subnets can have vastly different model sizes (e.g., 50M vs. 2G FLOPs), leading to different optimization directions and inferior performance. To address this challenge, we propose two novel sampling techniques: complexity-aware sampling and performance-aware sampling. Complexity-aware sampling limits the FLOPs difference among the subnets sampled across adjacent training steps, while covering different-sized subnets in the search space. Performance-aware sampling further selects subnets that have good accuracy, which can reduce gradient conflicts and improve supernet quality. Our discovered models, ElasticViT models, achieve top-1 accuracy from 67.2% to 80.0% on ImageNet from 60M to 800M FLOPs without extra retraining, outperforming all prior CNNs and ViTs in terms of accuracy and latency. Our tiny and small models are also the first ViT models that surpass state-of-the-art CNNs with significantly lower latency on mobile devices. For instance, ElasticViT-S1 runs 2.62x faster than EfficientNet-B0 with 0.1% higher accuracy.
神经网络架构搜索(NAS)在自动设计超过1G FLOP的视觉转换器(ViT)方面表现出令人瞩目的性能。然而,为各种移动设备设计轻量级、低延迟的ViT模型仍然是一个重大的挑战。在本工作中,我们提出了 ElasticViT,一种两阶段的NAS方法,在该方法中训练了一个高质量的ViT超级网络,在支持多种移动设备的极大搜索空间上运行,然后搜索最佳的子网络(子网)进行直接部署。然而,以前的超级网络训练方法依赖于均匀采样,却面临着梯度冲突问题。被采样的子网络可能具有极大的模型大小(例如,50M vs. 2G FLOPs),导致不同的优化方向和较差的性能。为了解决这个问题,我们提出了两个新的采样技术:复杂度意识采样和性能意识采样。复杂度意识采样限制了相邻训练步骤中子网络样本之间的FLOP差异,同时覆盖搜索空间中的不同大小子网络。性能意识采样进一步选择了具有良好精度的子网络,这可以减少梯度冲突并改善超级网络质量。我们发现的模型,ElasticViT模型,在ImageNet上从67.2%到80.0%的准确率,在没有额外的训练的情况下,从60M到800M FLOPs的精度范围内取得了比先前的CNN和ViT更高的性能和延迟,成为在移动设备上比当前最先进的CNNs更快、延迟更低的ViT模型的先驱。我们的小型模型也是第一个在移动设备上超越最先进的CNNs并具有更低延迟的ViT模型。例如,ElasticViT-S1的运行速度比EfficientNet-B0快2.62倍,准确率提高了0.1%。
https://arxiv.org/abs/2303.09730
Large pre-trained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) of a smaller student model addresses their inefficiency, allowing for deployment in resource-constraint environments. KD however remains ineffective, as the student is manually selected from a set of existing options already pre-trained on large corpora, a sub-optimal choice within the space of all possible student architectures. This paper proposes KD-NAS, the use of Neural Architecture Search (NAS) guided by the Knowledge Distillation process to find the optimal student model for distillation from a teacher, for a given natural language task. In each episode of the search process, a NAS controller predicts a reward based on a combination of accuracy on the downstream task and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full downstream task training set. When distilling on the MNLI task, our KD-NAS model produces a 2 point improvement in accuracy on GLUE tasks with equivalent GPU latency with respect to a hand-crafted student architecture available in the literature. Using Knowledge Distillation, this model also achieves a 1.4x speedup in GPU Latency (3.2x speedup on CPU) with respect to a BERT-Base Teacher, while maintaining 97% performance on GLUE Tasks (without CoLA). We also obtain an architecture with equivalent performance as the hand-crafted student model on the GLUE benchmark, but with a 15% speedup in GPU latency (20% speedup in CPU latency) and 0.8 times the number of parameters
大型预训练语言模型已经在许多下游任务中取得了最先进的结果。对小型学生模型的知识蒸馏(KD)解决了它们的效率问题,使得可以部署在资源受限的环境中。然而,KD仍然无效,因为学生是从一组已经针对大型语料库进行预训练的选择中手动选择的,是在所有可能学生架构空间中的次优选择。本文提出了KD-NAS,使用神经网络架构搜索(NAS)的指导知识蒸馏过程来找到从教师中知识蒸馏的最优学生模型,以给定自然语言任务。在搜索过程中,NAS控制器基于下游任务的准确性和推断延迟的组合预测奖励。最终,从教师中蒸馏的最佳学生架构在一个小型代理集合上进行了蒸馏。最后,选择的最高奖励架构被蒸馏到整个下游任务训练集上。在MNLI任务中,我们的KD-NAS模型在GLUE任务中实现了2点的准确性改进,与文献中可用的手工制作学生架构相比,具有与BERT基教师相同的GPU延迟。使用知识蒸馏,该模型还实现了GPU延迟的1.4倍速度提升(CPU延迟的3.2倍速度提升),同时保持了GLUE任务中的97%性能(在没有CoLA的情况下)。我们还获得了与手工制造学生架构相当的性能,但在GPU延迟上实现了15%的速度提升(CPU延迟上20%)和0.8倍参数数量。
https://arxiv.org/abs/2303.09639
SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast and efficient way. Specifically, we first incorporate linear attention (LA) and propose a hybrid LASA paradigm to increase the model's inference speed. Second, a hybrid neural architecture search (NAS) guided structural re-parameterization (SRep) mechanism, termed NSR, is proposed to enhance the ability of the model to extract local interactions. Extensive experiments conducted on the LibriSpeech dataset demonstrate that our proposed HybridFormer can achieve a 9.1% relative word error rate (WER) reduction over SqueezeFormer on the test-other dataset. Furthermore, when input speech is 30s, the HybridFormer can improve the model's inference speed up to 18%. Our source code is available online.
Squeeze former 在最近在自动语音识别(ASR)中表现出令人印象深刻的性能。然而,它的推理速度受到 softmax-attention (SA) 的平方复杂性的限制。此外,由于卷积核大小较大,Squeeze former local 建模能力还不够充分。在本文中,我们提出了一种新方法 hybrid former,以以快速高效的方式改进 Squeeze former。具体来说,我们首先引入了线性注意力(LA)并提出了 hybrid LASA 范式,以增加模型的推理速度。其次,我们提出了一种称为 NSR 的混合神经网络架构搜索引导的结构参数重构机制,以增强模型提取局部交互的能力。在 LibriSpeech 数据集上进行广泛的实验表明,我们提出的 hybrid former 在测试数据集上的相对单词错误率(WER)减少率高达 9.1%。此外,当输入语音为 30 秒时, hybrid former 可以提高模型的推理速度到 18%。我们的源代码可在在线获得。
https://arxiv.org/abs/2303.08636
In this paper, we investigate the relationship between diversity metrics, accuracy, and resiliency to natural image corruptions of Deep Learning (DL) image classifier ensembles. We investigate the potential of an attribution-based diversity metric to improve the known accuracy-diversity trade-off of the typical prediction-based diversity. Our motivation is based on analytical studies of design diversity that have shown that a reduction of common failure modes is possible if diversity of design choices is achieved. Using ResNet50 as a comparison baseline, we evaluate the resiliency of multiple individual DL model architectures against dataset distribution shifts corresponding to natural image corruptions. We compare ensembles created with diverse model architectures trained either independently or through a Neural Architecture Search technique and evaluate the correlation of prediction-based and attribution-based diversity to the final ensemble accuracy. We evaluate a set of diversity enforcement heuristics based on negative correlation learning to assess the final ensemble resilience to natural image corruptions and inspect the resulting prediction, activation, and attribution diversity. Our key observations are: 1) model architecture is more important for resiliency than model size or model accuracy, 2) attribution-based diversity is less negatively correlated to the ensemble accuracy than prediction-based diversity, 3) a balanced loss function of individual and ensemble accuracy creates more resilient ensembles for image natural corruptions, 4) architecture diversity produces more diversity in all explored diversity metrics: predictions, attributions, and activations.
在本文中,我们研究了深度学习(DL)图像分类器集体对自然图像 corruptions的 resilient 和多样性之间的关系。我们研究基于贡献( attribution)的多样性度量来改善典型的预测多样性中的准确性和多样性之间的权衡。我们的研究动力是基于设计多样性的分析研究,该研究已经表明,如果设计选择的多样性得到实现,可以减少常见的故障模式。使用ResNet50作为比较基准,我们评估了多个个体深度学习模型架构对数据集分布 shift 对应的自然图像 corruptions 的 resilient 能力。我们比较了由多种模型架构训练或通过神经网络架构搜索技术训练的集体,并评估预测和贡献多样性与最终集体准确性之间的相关性。我们评估了基于负相关学习的多样性执行技巧,以评估最终集体对自然图像 corruptions 的 resilient 能力,并检查产生预测、激活和贡献多样性的结果。我们的关键观察是:1) 模型架构对 resilient 能力的重要性比模型大小或模型精度更高,2) 贡献多样性与集体准确性之间的负相关性比预测多样性低,3) 个体和集体准确性的平衡损失函数创造了更抗损坏的集体对图像自然 corruptions,4) 架构多样性在所有探索的多样性度量中产生了更多的多样性:预测、贡献和激活。
https://arxiv.org/abs/2303.09283
The combination of Neural Architecture Search (NAS) and quantization has proven successful in automatically designing low-FLOPs INT8 quantized neural networks (QNN). However, directly applying NAS to design accurate QNN models that achieve low latency on real-world devices leads to inferior performance. In this work, we find that the poor INT8 latency is due to the quantization-unfriendly issue: the operator and configuration (e.g., channel width) choices in prior art search spaces lead to diverse quantization efficiency and can slow down the INT8 inference speed. To address this challenge, we propose SpaceEvo, an automatic method for designing a dedicated, quantization-friendly search space for each target hardware. The key idea of SpaceEvo is to automatically search hardware-preferred operators and configurations to construct the search space, guided by a metric called Q-T score to quantify how quantization-friendly a candidate search space is. We further train a quantized-for-all supernet over our discovered search space, enabling the searched models to be directly deployed without extra retraining or quantization. Our discovered models establish new SOTA INT8 quantized accuracy under various latency constraints, achieving up to 10.1% accuracy improvement on ImageNet than prior art CNNs under the same latency. Extensive experiments on diverse edge devices demonstrate that SpaceEvo consistently outperforms existing manually-designed search spaces with up to 2.5x faster speed while achieving the same accuracy.
神经网络架构搜索(NAS)和量化组合已经证明在自动设计低Flops INT8量化神经网络(QNN)方面是成功的。然而,直接应用NAS设计在真实设备上实现低延迟的准确QNN模型会导致性能下降。在这项工作中,我们发现INT8延迟差是由于量化不友好的问题:先前搜索空间中的operator和配置(例如通道宽度)选择会导致不同的量化效率,并且可能会减缓INT8推理速度。为了解决这个问题,我们提出了SpaceEvo,一种自动方法,为每个目标硬件设计一个专门的、量化友好的搜索空间。SpaceEvo的关键思想是自动搜索硬件偏好的operator和配置,以构建搜索空间,并使用名为Q-T score的度量指标量化一个候选搜索空间是否量化友好。我们还在我们的发现搜索空间上训练了一个量化对所有设备的超级net,使搜索模型可以直接部署,而无需额外的培训和量化。我们发现的模型在多种延迟限制下建立了新的SOTAINT8量化准确性,在ImageNet上比先前的CNNs实现了10.1%的精度提高。在各种不同的边缘设备上进行广泛的实验表明,SpaceEvo consistently outperforms existing manually-designed搜索空间,并具有2.5倍更快的速度和相同的准确性。
https://arxiv.org/abs/2303.08308
Lifelong learning without catastrophic forgetting (i.e., resiliency) possessed by human intelligence is entangled with sophisticated memory mechanisms in the brain, especially the long-term memory (LM) maintained by Hippocampi. To a certain extent, Transformers have emerged as the counterpart ``Brain" of Artificial Intelligence (AI), and yet leave the LM component under-explored for lifelong learning settings. This paper presents a method of learning to grow Artificial Hippocampi (ArtiHippo) in Vision Transformers (ViTs) for resilient lifelong learning. With a comprehensive ablation study, the final linear projection layer in the multi-head self-attention (MHSA) block is selected in realizing and growing ArtiHippo. ArtiHippo is represented by a mixture of experts (MoEs). Each expert component is an on-site variant of the linear projection layer, maintained via neural architecture search (NAS) with the search space defined by four basic growing operations -- skip, reuse, adapt, and new in lifelong learning. The LM of a task consists of two parts: the dedicated expert components (as model parameters) at different layers of a ViT learned via NAS, and the mean class-tokens (as stored latent vectors for measuring task similarity) associated with the expert components. For a new task, a hierarchical task-similarity-oriented exploration-exploitation sampling based NAS is proposed to learn the expert components. The task similarity is measured based on the normalized cosine similarity between the mean class-token of the new task and those of old tasks. The proposed method is complementary to prompt-based lifelong learningwith ViTs. In experiments, the proposed method is tested on the challenging Visual Domain Decathlon (VDD) benchmark and the recently proposed 5-Dataset benchmark. It obtains consistently better performance than the prior art with sensible ArtiHippo learned continually.
人类智慧拥有灾难性遗忘(即恢复能力)的长时间学习与大脑中的复杂记忆机制,特别是由Hippocampi维护的长期记忆(LM)密切相关。在某种程度上,Transformers已成为人工智能(AI)的“大脑”,但LM组件在 lifelong learning 设置中仍然未被充分探索。本文提出了一种方法,在 Vision Transformers(ViTs)中培养Artificial Hippocampi(ArtiHippo),以恢复性的长时间学习。通过全面的研究,选择了多眼自注意力(MHSA)块中的最后线性投影层,以实现和培养 ArtiHippo。ArtiHippo 由专家混合组成(MoEs)。每个专家组件都是线性投影层的本地变体,通过神经网络架构搜索(NAS)维护,搜索空间由四个基本的生长操作——跳过、重用、适应和新的 lifelong learning 中的基本生长操作定义。任务 LM 由两个部分组成:专门专家组件(作为模型参数)在 ViT 的不同层中通过NAS 学习,以及与专家组件相关的平均类token(作为存储的任务相似性潜在向量)。对于新任务,提出了基于Hierarchical Task Similarity-Oriented exploration-exploitation 采样的NAS,以学习专家组件。任务相似性基于新任务和旧任务的平均类token的等方相似性进行测量。该方法与基于 prompt-based 长时间学习 ViTs 的方法互为补充。在实验中,该方法在挑战性的 Visual Domain Decathlon(VDD)基准和最近提出的 5 数据集基准上进行了测试。它比先前的方法取得了 consistently better 的性能。
https://arxiv.org/abs/2303.08250
Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance. Previous works explore new optimizing targets and search spaces for NAS ReID, yet they neglect the difference of training schemes between image classification and ReID. In this work, we propose a novel Twins Contrastive Mechanism (TCM) to provide more appropriate supervision for ReID architecture search. TCM reduces the category overlaps between the training and validation data, and assists NAS in simulating real-world ReID training schemes. We then design a Multi-Scale Interaction (MSI) search space to search for rational interaction operations between multi-scale features. In addition, we introduce a Spatial Alignment Module (SAM) to further enhance the attention consistency confronted with images from different sources. Under the proposed NAS scheme, a specific architecture is automatically searched, named as MSINet. Extensive experiments demonstrate that our method surpasses state-of-the-art ReID methods on both in-domain and cross-domain scenarios. Source code available in this https URL.
神经网络结构搜索(NAS)越来越吸引对象重新识别(ReID)社会的关注,因为任务特定的架构 significantly 改善检索性能。以前的工作探索了NAS ReID新的优化目标和搜索空间,但它们忽略了图像分类和ReID之间的训练计划差异。在本文中,我们提出了一种全新的双工比较机制(TCM),以提供更适当的监督ReID架构搜索。TCM减少了训练和验证数据之间的类别重叠,并协助NAS模拟真实的ReID训练计划。随后,我们设计了一个多尺度交互(MSI)搜索空间,以搜索多尺度特征之间的合理交互操作。此外,我们引入了一个空间定位模块(SAM),以进一步加强面对来自不同来源的图像的注意力一致性。在所提出的NAS计划下,一种特定的架构被自动搜索,称为MSINet。广泛的实验表明,我们的方法在跨域和域内场景上超越了最先进的ReID方法。源代码可在本 https URL 上获取。
https://arxiv.org/abs/2303.07065
Touch-based fingerprint biometrics is one of the most popular biometric modalities with applications in several fields. Problems associated with touch-based techniques such as the presence of latent fingerprints and hygiene issues due to many people touching the same surface motivated the community to look for non-contact-based solutions. For the last few years, contactless fingerprint systems are on the rise and in demand because of the ability to turn any device with a camera into a fingerprint reader. Yet, before we can fully utilize the benefit of noncontact-based methods, the biometric community needs to resolve a few concerns such as the resiliency of the system against presentation attacks. One of the major obstacles is the limited publicly available data sets with inadequate spoof and live data. In this publication, we have developed a Presentation attack detection (PAD) dataset of more than 7500 four-finger images and more than 14,000 manually segmented single-fingertip images, and 10,000 synthetic fingertips (deepfakes). The PAD dataset was collected from six different Presentation Attack Instruments (PAI) of three different difficulty levels according to FIDO protocols, with five different types of PAI materials, and different smartphone cameras with manual focusing. We have utilized DenseNet-121 and NasNetMobile models and our proposed dataset to develop PAD algorithms and achieved PAD accuracy of Attack presentation classification error rate (APCER) 0.14\% and Bonafide presentation classification error rate (BPCER) 0.18\%. We have also reported the test results of the models against unseen spoof types to replicate uncertain real-world testing scenarios.
基于触摸的指纹识别是几种领域最受欢迎的指纹识别模式之一。与触摸技术相关的一些问题,例如潜在的指纹残留和由于许多人同时触摸同一表面而产生的卫生问题,促使了指纹识别社区寻找非接触式的解决方案。过去几年中,无接触式指纹识别系统正在增长并受到需求,因为它们可以将任何带有摄像头的设备变成指纹读取器。然而,在我们充分利用非接触式方法的好处之前,指纹识别社区需要解决一些关注点,例如系统对演示攻击的抵御能力。其中一个主要障碍是缺乏足够的假数据和实际数据。在本文中,我们开发了超过7,500张四指图像和超过14,000张手动分割的单指tip图像,以及10,000张合成 fingertips (Deepfakes)的演示攻击检测数据集(PAD)。该数据集从六个不同的演示攻击工具(PAI)按照FIDO协议从三个不同难度级别收集,使用了五种不同的PAI材料,以及不同手动聚焦的智能手机摄像头。我们使用DenseNet-121和 NasNetMobile模型及其提出的数据集来开发pad算法,并实现了攻击演示分类错误率(APCER)为0.14%和正当演示分类错误率(BPCER)为0.18%的pad精度。我们还报告了模型对未观察到的假类型的结果测试结果,以模拟不确定的现实世界测试场景。
https://arxiv.org/abs/2303.05459
Vision Transformers have enabled recent attention-based Deep Learning (DL) architectures to achieve remarkable results in Computer Vision (CV) tasks. However, due to the extensive computational resources required, these architectures are rarely implemented on resource-constrained platforms. Current research investigates hybrid handcrafted convolution-based and attention-based models for CV tasks such as image classification and object detection. In this paper, we propose HyT-NAS, an efficient Hardware-aware Neural Architecture Search (HW-NAS) including hybrid architectures targeting vision tasks on tiny devices. HyT-NAS improves state-of-the-art HW-NAS by enriching the search space and enhancing the search strategy as well as the performance predictors. Our experiments show that HyT-NAS achieves a similar hypervolume with less than ~5x training evaluations. Our resulting architecture outperforms MLPerf MobileNetV1 by 6.3% accuracy improvement with 3.5x less number of parameters on Visual Wake Words.
视觉转换器使近年来基于注意力的深度学习(DL)架构在计算机视觉任务中取得了显著的成果。然而,由于需要大量计算资源,这些架构在资源受限平台上很少实现。当前研究调查了基于手工手工特征卷积和注意力的CV任务模型,例如图像分类和物体检测。在本文中,我们提出了HyT-NAS,一种高效的硬件意识到神经网络架构搜索(HW-NAS),包括针对小型设备的 Vision 任务的新型混合架构。HyT-NAS改进了最先进的HW-NAS技术,通过扩展搜索空间、增强搜索策略和性能预测,提高了性能。我们的实验表明,HyT-NAS在训练评估数小于 ~5x 的情况下可以实现类似的超体积。我们的结果架构在视觉唤醒词上比MLPerf MobileNetV1提高了6.3%的准确率,在参数数量上减少了3.5x。
https://arxiv.org/abs/2303.04440
This project aimed to determine the grain size distribution of granular materials from images using convolutional neural networks. The application of ConvNet and pretrained ConvNet models, including AlexNet, SqueezeNet, GoogLeNet, InceptionV3, DenseNet201, MobileNetV2, ResNet18, ResNet50, ResNet101, Xception, InceptionResNetV2, ShuffleNet, and NASNetMobile was studied. Synthetic images of granular materials created with the discrete element code YADE were used. All the models were trained and verified with grayscale and color band datasets with image sizes ranging from 32 to 160 pixels. The proposed ConvNet model predicts the percentages of mass retained on the finest sieve, coarsest sieve, and all sieves with root-mean-square errors of 1.8 %, 3.3 %, and 2.8 %, respectively, and a coefficient of determination of 0.99. For pretrained networks, root-mean-square errors of 2.4 % and 2.8 % were obtained for the finest sieve with feature extraction and transfer learning models, respectively.
该项目旨在使用卷积神经网络从图像中确定豆类材料的粒度分布。研究了使用卷积神经网络和预训练卷积神经网络模型的应用,包括AlexNet、SqueezeNet、GoogLeNet、InceptionV3、DenseNet201、MobileNetV2、ResNet18、ResNet50、ResNet101、Xception、InceptionResNetV2、ShuffleNet和NASNetMobile。使用离散元素代码YADE生成的豆类材料合成图像。所有模型通过图像大小从32到160像素的灰度数据和彩色数据集进行训练和验证。提议的卷积神经网络模型预测了在细筛、粗筛和所有筛子中保留质量的百分比,其平方根误差分别为1.8%、3.3%、2.8%,而决定系数为0.99。对于预训练网络,在特征提取和迁移学习模型的帮助下,细筛的平方根根误差分别为2.4%和2.8%。
https://arxiv.org/abs/2303.04269
3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, \ie entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1\&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at this https URL.
3D卷积神经网络(CNN)已经成为视频识别的主要选择。为了捕获时间信息,3D卷积在序列中计算,导致立方增长且计算成本增加。为了降低计算成本,以前的方法和手动设计的3D/2D CNN结构以及自动搜索,都依赖于近似或自动搜索,牺牲了建模能力或使训练时间变得漫长。在本文中,我们提议通过一种专门为3D CNN设计的无训练的神经网络架构搜索方法,考虑模型复杂性,开发一种高效、富有表现力的3D CNN架构。为了有效地测量3D CNN的表达力,我们将其定义成一个信息系统,并基于最大熵原则推导出Analytic Entropy Score。具体来说,我们提议一个空间时间熵得分(STEntr-Score),并添加一个改进因子,以处理空间时间和维度的视觉信息差异,通过动态地利用特征映射大小和内核大小的Depthwisely相关关系。在一些关于Something-Something V1&V2和Kinetics400的实验中,广泛证明了E3D家族(E3D family)以更高效的计算效率实现了最先进的性能。代码可在本网站 https URL 中获取。
https://arxiv.org/abs/2303.02693
Miniaturized autonomous unmanned aerial vehicles (UAVs) are an emerging and trending topic. With their form factor as big as the palm of one hand, they can reach spots otherwise inaccessible to bigger robots and safely operate in human surroundings. The simple electronics aboard such robots (sub-100mW) make them particularly cheap and attractive but pose significant challenges in enabling onboard sophisticated intelligence. In this work, we leverage a novel neural architecture search (NAS) technique to automatically identify several Pareto-optimal convolutional neural networks (CNNs) for a visual pose estimation task. Our work demonstrates how real-life and field-tested robotics applications can concretely leverage NAS technologies to automatically and efficiently optimize CNNs for the specific hardware constraints of small UAVs. We deploy several NAS-optimized CNNs and run them in closed-loop aboard a 27-g Crazyflie nano-UAV equipped with a parallel ultra-low power System-on-Chip. Our results improve the State-of-the-Art by reducing the in-field control error of 32% while achieving a real-time onboard inference-rate of ~10Hz@10mW and ~50Hz@90mW.
小型自主无人机(UAV)是一个新兴且趋势性的议题。它们的形态 factor 像手 palm 大小,可以在难以到达的点上安全地操作,对人类环境进行安全地操作。这些机器人上的简单电子设备( sub-100mW )使其特别便宜和吸引人,但为实现车内高级智能而面临的挑战特别大。在这项工作中,我们利用一种新的神经网络架构搜索(NAS)技术,自动识别了视觉姿态估计任务中的几个 Pareto 最优卷积神经网络(CNNs)。我们的工作展示了如何使用实际生产和测试的机器人应用 concretely 利用NAS技术,以自动且高效优化小型无人机的特殊硬件限制的 CNNs。我们部署了几只NAS 优化的 CNNs,并在配备并行 ultra-low 功率系统-on-chip 的 27-g Crazyflie 纳米级无人机上闭环运行。我们的结果提高了技术水平,通过减少现场控制误差 32%,同时实现 ~10Hz@10mW 和 ~50Hz@90mW 的实时车内推理速率。
https://arxiv.org/abs/2303.01931
The remarkable performance of deep Convolutional neural networks (CNNs) is generally attributed to their deeper and wider architectures, which can come with significant computational costs. Pruning neural networks has thus gained interest since it effectively lowers storage and computational costs. In contrast to weight pruning, which results in unstructured models, structured pruning provides the benefit of realistic acceleration by producing models that are friendly to hardware implementation. The special requirements of structured pruning have led to the discovery of numerous new challenges and the development of innovative solutions. This article surveys the recent progress towards structured pruning of deep CNNs. We summarize and compare the state-of-the-art structured pruning techniques with respect to filter ranking methods, regularization methods, dynamic execution, neural architecture search, the lottery ticket hypothesis, and the applications of pruning. While discussing structured pruning algorithms, we briefly introduce the unstructured pruning counterpart to emphasize their differences. Furthermore, we provide insights into potential research opportunities in the field of structured pruning. A curated list of neural network pruning papers can be found at this https URL
深度学习卷积神经网络(CNN)的出色表现通常归咎于其更深层和更广阔的架构,这可能会导致高昂的计算成本。因此,神经网络修剪变得越来越受欢迎,因为它有效地降低了存储和计算成本。与重量修剪(导致无结构模型)相比,结构修剪产生实际加速的效果,通过产生对硬件实现友好的模型。结构修剪的特殊要求导致了发现许多新挑战和创新解决方案。本文综述了近年来深度学习CNN结构修剪的进展。我们总结和比较了最先进的结构修剪技术,与筛选排名方法、 Regularization方法、动态执行、神经网络架构搜索、彩票假设和修剪应用等。在讨论结构修剪算法时,我们简要介绍了无结构修剪的对应版本,以强调它们之间的差异。此外,我们提供了在结构修剪领域的潜在研究机会的洞察力。一个精选的神经网络修剪论文列表可以在这个 https URL 中找到。
https://arxiv.org/abs/2303.00566
Given the recent impressive accomplishments of language models (LMs) for code generation, we explore the use of LMs as adaptive mutation and crossover operators for an evolutionary neural architecture search (NAS) algorithm. While NAS still proves too difficult a task for LMs to succeed at solely through prompting, we find that the combination of evolutionary prompt engineering with soft prompt-tuning, a method we term EvoPrompting, consistently finds diverse and high performing models. We first demonstrate that EvoPrompting is effective on the computationally efficient MNIST-1D dataset, where EvoPrompting produces convolutional architecture variants that outperform both those designed by human experts and naive few-shot prompting in terms of accuracy and model size. We then apply our method to searching for graph neural networks on the CLRS Algorithmic Reasoning Benchmark, where EvoPrompting is able to design novel architectures that outperform current state-of-the-art models on 21 out of 30 algorithmic reasoning tasks while maintaining similar model size. EvoPrompting is successful at designing accurate and efficient neural network architectures across a variety of machine learning tasks, while also being general enough for easy adaptation to other tasks beyond neural network design.
考虑到最近语言模型(LMs)在生成代码方面所取得的令人印象深刻的成就,我们探讨了将LMs作为自适应变异和交叉操作 Operator 用于进化神经网络搜索算法(NAS)的方法。尽管NAS仍然证明只有通过prompting才能通过LMs成功,但我们发现进化prompt engineering 与软prompt-tuning的结合能够 consistently 找到多样化且表现优异的模型。我们首先证明了 EvoPrompting 在计算效率高的MNIST-1D数据集上有效, EvoPrompting 生产卷积架构变异,在准确性和模型大小方面比人类专家设计和简单的多次promptprompting 更有效。然后我们在CLRS算法推理基准上应用我们的方法和搜索图神经网络, EvoPrompting 能够在保持相似模型大小的情况下设计新颖的架构,在21个算法推理任务中比当前最先进的模型表现更好。EvoPrompting 在多种机器学习任务中成功地设计了准确高效的神经网络架构,同时也足够通用,能够轻松适应除了神经网络设计之外的其他任务。
https://arxiv.org/abs/2302.14838
Existing one-shot neural architecture search (NAS) methods have to conduct a search over a giant super-net, which leads to the huge computational cost. To reduce such cost, in this paper, we propose a method, called FTSO, to divide the whole architecture search into two sub-steps. Specifically, in the first step, we only search for the topology, and in the second step, we search for the operators. FTSO not only reduces NAS's search time from days to 0.68 seconds, but also significantly improves the found architecture's accuracy. Our extensive experiments on ImageNet show that within 18 seconds, FTSO can achieve a 76.4% testing accuracy, 1.5% higher than the SOTA, PC-DARTS. In addition, FTSO can reach a 97.77% testing accuracy, 0.27% higher than the SOTA, with nearly 100% (99.8%) search time saved, when searching on CIFAR10.
现有的一次性神经网络结构搜索(NAS)方法需要搜索整个大型超级网络,导致巨大的计算成本。为了降低这种成本,在本文中,我们提出了一种方法,称为FTSO,将整个结构搜索分为两个子步骤。具体来说,在第一个步骤中,我们仅搜索拓扑结构,在第二个步骤中,我们搜索操作员。FTSO不仅将NAS搜索时间从几天降低到0.68秒,而且还显著提高了找到的结构的准确性。我们在ImageNet上进行广泛的实验表明,在18秒内,FTSO可以实现76.4%的测试精度,比SOTA高1.5%。此外,FTSO可以在搜索CIFAR10时实现97.77%的测试精度,比SOTA高0.27%,几乎节省(99.8%)搜索时间。
https://arxiv.org/abs/2303.12948