The expanding size of language models has created the necessity for a comprehensive examination across various dimensions that reflect the desiderata with respect to the tradeoffs between various hardware metrics, such as latency, energy consumption, GPU memory usage, and performance. There is a growing interest in establishing Pareto frontiers for different language model configurations to identify optimal models with specified hardware constraints. Notably, architectures that excel in latency on one device may not perform optimally on another. However, exhaustive training and evaluation of numerous architectures across diverse hardware configurations is computationally prohibitive. To this end, we propose HW-GPT-Bench, a hardware-aware language model surrogate benchmark, where we leverage weight-sharing techniques from Neural Architecture Search (NAS) to efficiently train a supernet proxy, encompassing language models of varying scales in a single model. We conduct profiling of these models across 13 devices, considering 5 hardware metrics and 3 distinct model scales. Finally, we showcase the usability of HW-GPT-Bench using 8 different multi-objective NAS algorithms and evaluate the quality of the resultant Pareto fronts. Through this benchmark, our objective is to propel and expedite research in the advancement of multi-objective methods for NAS and structural pruning in large language models.
随着语言模型的大小不断扩展,在各种硬件指标之间进行全面的权衡已经变得必要。为了满足硬件约束,人们越来越关注为不同的语言模型配置建立Pareto前沿,以确定指定的硬件约束下的最优模型。值得注意的是,在单个设备上表现出卓越延迟的架构在其他设备上可能不会表现最优。然而,对多种硬件配置下的大量架构进行详尽训练和评估是计算上过于耗费资源的。为此,我们提出了HW-GPT-Bench,一个硬件感知的语言模型代理基准,利用来自神经架构搜索(NAS)的权重共享技术,以在单个模型中高效训练一个超级网络代理,涵盖不同规模的语言模型。我们在13个设备上对这些模型进行 profiling,考虑了5个硬件指标和3个不同的模型规模。最后,我们使用8种不同的多目标NAS算法展示了HW-GPT-Bench的可用性,并评估了由此产生的Pareto前沿的质量。通过这个基准,我们的目标是以推动和研究大型语言模型中多目标方法和结构修剪的进展为目的,加快研究步伐。
https://arxiv.org/abs/2405.10299
Over the past few years, as large language models have ushered in an era of intelligence emergence, there has been an intensified focus on scaling networks. Currently, many network architectures are designed manually, often resulting in sub-optimal configurations. Although Neural Architecture Search (NAS) methods have been proposed to automate this process, they suffer from low search efficiency. This study introduces Differentiable Model Scaling (DMS), increasing the efficiency for searching optimal width and depth in networks. DMS can model both width and depth in a direct and fully differentiable way, making it easy to optimize. We have evaluated our DMS across diverse tasks, ranging from vision tasks to NLP tasks and various network architectures, including CNNs and Transformers. Results consistently indicate that our DMS can find improved structures and outperforms state-of-the-art NAS methods. Specifically, for image classification on ImageNet, our DMS improves the top-1 accuracy of EfficientNet-B0 and Deit-Tiny by 1.4% and 0.6%, respectively, and outperforms the state-of-the-art zero-shot NAS method, ZiCo, by 1.3% while requiring only 0.4 GPU days for searching. For object detection on COCO, DMS improves the mAP of Yolo-v8-n by 2.0%. For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zero-shot classification accuracy. We will release our code in the future.
在过去的几年里,随着大型语言模型的出现,我们越来越关注扩展网络。目前,许多网络架构是手动设计的,通常导致 sub-optimal 配置。尽管已经提出了神经架构搜索(NAS)方法来自动化这个过程,但它们搜索效率较低。本文介绍了 Differentiable Model Scaling(DMS),增加了在网络中搜索最优宽度和深度的效率。DMS 可以直接和完全不同心地建模宽度和深度,使得优化变得更加容易。我们在各种任务上评估了 DMS,包括从视觉任务到自然语言处理任务和各种网络架构,包括卷积神经网络(CNN)和Transformer。结果表明,我们的 DMS 可以找到更好的结构和优于最先进的 NAS 方法。具体来说,在 ImageNet 上进行图像分类时,我们的 DMS 分别提高了 EfficientNet-B0 和 Deit-Tiny 的 top-1 准确率 by 1.4% 和 0.6%,并分别超过了最先进的 zero-shot NAS 方法 ZiCo 的 1.3% 的准确率,同时仅用 0.4 个 GPU days 的搜索时间。在 COCO 上进行对象检测时,DMS 提高了 Yolo-v8-n 的 mAP by 2.0%。对于自然语言处理,经过优化的 Llama-7B 击败了之前的方法,降低了 perplexity,提高了零散准确率。我们将发布我们的代码。
https://arxiv.org/abs/2405.07194
Neural Architecture Search (NAS) methods have shown to output networks that largely outperform human-designed networks. However, conventional NAS methods have mostly tackled the single dataset scenario, incuring in a large computational cost as the procedure has to be run from scratch for every new dataset. In this work, we focus on predictor-based algorithms and propose a simple and efficient way of improving their prediction performance when dealing with data distribution shifts. We exploit the Kronecker-product on the randomly wired search-space and create a small NAS benchmark composed of networks trained over four different datasets. To improve the generalization abilities, we propose GRASP-GCN, a ranking Graph Convolutional Network that takes as additional input the shape of the layers of the neural networks. GRASP-GCN is trained with the not-at-convergence accuracies, and improves the state-of-the-art of 3.3 % for Cifar-10 and increasing moreover the generalization abilities under data distribution shift.
神经架构搜索(NAS)方法已经证明了它们可以输出 largely 超过人类设计的网络。然而,传统的 NAS 方法主要解决了单一数据集场景,导致需要从头开始运行新数据集的步骤,从而产生了较大的计算成本。在本文中,我们关注基于预测器的算法,并提出了一种简单而有效的改进方法,以提高当处理数据分布变化时的预测性能。我们在随机布线的搜索空间上利用 Kronecker 乘积,并构建了一个由四个不同数据集训练的简单 NAS 基准。为了提高泛化能力,我们提出了 GRASP-GCN,一种排名图卷积网络,它还作为额外的输入考虑了神经网络层形的形状。GRASP-GCN 使用非收敛精度进行训练,并在 Cifar-10 上实现了 3.3% 的最先进性能,并且随着数据分布的变化,提高了其泛化能力。
https://arxiv.org/abs/2405.06994
We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. Extensive experiments with six tasks on NYU v2, CityScapes, and Taskonomy datasets using VGG, ResNet, and ViT backbones validate the promising performance. The codes are available at this https URL.
我们旨在利用独立辅助任务中的额外辅助标签来提高我们关注的 primary 任务的性能,同时保留主要任务的单任务推理成本。虽然大多数现有的辅助学习方法是基于优化的,依赖于损失权重/梯度操作,我们的方法是基于架构的,具有灵活的不对称结构,为主要和辅助任务产生不同的网络。具体来说,从两个单任务网络/分支(每个代表一个任务)开始,我们提出了一个随机的网络方法,在收敛后存在仅主到辅助的跨任务连接。这些连接在主要任务推理过程中可以被删除,从而实现单任务推理成本。我们通过将 Neural Architecture Search (NAS) 问题形式化,在搜索空间中初始化双向连接,并引导 NAS 优化收敛到仅具有单侧主到辅助连接的架构,实现了这一点。此外,我们的方法可以与基于优化的辅助学习方法相结合。使用 VGG、ResNet 和 ViT 骨干网络在 NYU v2、CityScapes 和 Taskonomy 数据集上进行大量实验,验证了其具有鼓舞人心的性能。代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2405.05695
To defend deep neural networks from adversarial attacks, adversarial training has been drawing increasing attention for its effectiveness. However, the accuracy and robustness resulting from the adversarial training are limited by the architecture, because adversarial training improves accuracy and robustness by adjusting the weight connection affiliated to the architecture. In this work, we propose ARNAS to search for accurate and robust architectures for adversarial training. First we design an accurate and robust search space, in which the placement of the cells and the proportional relationship of the filter numbers are carefully determined. With the design, the architectures can obtain both accuracy and robustness by deploying accurate and robust structures to their sensitive positions, respectively. Then we propose a differentiable multi-objective search strategy, performing gradient descent towards directions that are beneficial for both natural loss and adversarial loss, thus the accuracy and robustness can be guaranteed at the same time. We conduct comprehensive experiments in terms of white-box attacks, black-box attacks, and transferability. Experimental results show that the searched architecture has the strongest robustness with the competitive accuracy, and breaks the traditional idea that NAS-based architectures cannot transfer well to complex tasks in robustness scenarios. By analyzing outstanding architectures searched, we also conclude that accurate and robust neural architectures tend to deploy different structures near the input and output, which has great practical significance on both hand-crafting and automatically designing of accurate and robust architectures.
为了防御对抗性攻击,对抗性训练已经越来越受到关注,因为它的有效性。然而,对抗性训练产生的准确性和稳健性是有局限性的,因为对抗性训练通过调整与架构相关的权重连接来提高准确性和稳健性。在本文中,我们提出了ARNAS来搜索对抗性训练的准确性和稳健架构。首先,我们设计了一个准确且鲁棒的设计空间,其中单元格的放置和滤波器数量的比例关系被仔细确定。通过这个设计,架构可以通过在其敏感位置部署准确且鲁棒的结构来获得准确性和稳健性。然后,我们提出了一个不同的多目标搜索策略,沿着对自然损失和对抗损失都有益的方向进行梯度下降,从而确保准确性和稳健性。我们在白盒攻击、黑盒攻击和可转移性方面进行了全面的实验。实验结果表明,所搜索的架构具有与竞争准确度相同的鲁棒性,并打破了传统观念,即基于NAS的架构在鲁棒性场景下的复杂任务转移效果不佳。通过分析所搜索的最优秀的架构,我们得出的结论是,准确且鲁棒的神经架构通常会在输入和输出附近部署不同的结构,这在手绘和自动设计准确且鲁棒的架构方面具有很大的实际意义。
https://arxiv.org/abs/2405.05502
Accurate classification of medical images is essential for modern diagnostics. Deep learning advancements led clinicians to increasingly use sophisticated models to make faster and more accurate decisions, sometimes replacing human judgment. However, model development is costly and repetitive. Neural Architecture Search (NAS) provides solutions by automating the design of deep learning architectures. This paper presents ZO-DARTS+, a differentiable NAS algorithm that improves search efficiency through a novel method of generating sparse probabilities by bi-level optimization. Experiments on five public medical datasets show that ZO-DARTS+ matches the accuracy of state-of-the-art solutions while reducing search times by up to three times.
准确地分类医学图像对于现代诊断至关重要。深度学习的发展使得临床医生越来越依赖复杂的模型来做出更快、更准确的决策,有时甚至取代了人类的判断。然而,模型开发成本高且重复。神经架构搜索(NAS)通过自动设计深度学习架构提供了解决方案。本文介绍了ZO-DARTS+,一种可差分神经架构搜索算法,通过一种新的方法通过双层优化生成稀疏概率来提高搜索效率。在五个公开医疗数据集上的实验表明,ZO-DARTS+与最先进的解决方案在准确性上相匹敌,同时将搜索时间减少了30%以上。
https://arxiv.org/abs/2405.03462
For Ising models with complex energy landscapes, whether the ground state can be found by neural networks depends heavily on the Hamming distance between the training datasets and the ground state. Despite the fact that various recently proposed generative models have shown good performance in solving Ising models, there is no adequate discussion on how to quantify their generalization capabilities. Here we design a Hamming distance regularizer in the framework of a class of generative models, variational autoregressive networks (VAN), to quantify the generalization capabilities of various network architectures combined with VAN. The regularizer can control the size of the overlaps between the ground state and the training datasets generated by networks, which, together with the success rates of finding the ground state, form a quantitative metric to quantify their generalization capabilities. We conduct numerical experiments on several prototypical network architectures combined with VAN, including feed-forward neural networks, recurrent neural networks, and graph neural networks, to quantify their generalization capabilities when solving Ising models. Moreover, considering the fact that the quantification of the generalization capabilities of networks on small-scale problems can be used to predict their relative performance on large-scale problems, our method is of great significance for assisting in the Neural Architecture Search field of searching for the optimal network architectures when solving large-scale Ising models.
对于具有复杂能量 landscape的Ising模型,神经网络找到 ground state 的能力取决于训练数据集和 ground state 之间的哈密距离。尽管最近提出的生成模型在解决 Ising 模型方面表现出良好的性能,但如何衡量它们的泛化能力仍然缺乏适当的讨论。因此,我们设计了一个哈密距离 regularizer,该 regularizer 基于一类生成模型(VAN,变分自回归网络),以衡量各种网络架构与 VAN 的组合的泛化能力。该 regularizer 可以控制网络产生的 ground state 和训练数据集中的 ground state 之间的重叠的大小,从而与找到 ground state 的成功率一起形成一个定量的度量标准,以衡量它们的泛化能力。我们在几个具有代表性的网络架构(包括前馈神经网络、循环神经网络和图神经网络)上进行数值实验,以衡量它们在解决 Ising 模型时的泛化能力。此外,考虑到在小型问题上的网络泛化能力量的预测可以用于预测它们在大型问题上的相对性能,我们的方法在帮助寻找解决大型 Ising 模型的最佳网络架构方面具有很大的意义。
https://arxiv.org/abs/2405.03435
This paper introduces a novel approach to enhance the performance of pre-trained neural networks in medical image segmentation using Neural Architecture Search (NAS) methods, specifically Differentiable Architecture Search (DARTS). We present the concept of Implantable Adaptive Cell (IAC), small but powerful modules identified through Partially-Connected DARTS, designed to be injected into the skip connections of an existing and already trained U-shaped model. Our strategy allows for the seamless integration of the IAC into the pre-existing architecture, thereby enhancing its performance without necessitating a complete retraining from scratch. The empirical studies, focusing on medical image segmentation tasks, demonstrate the efficacy of this method. The integration of specialized IAC cells into various configurations of the U-Net model increases segmentation accuracy by almost 2\% points on average for the validation dataset and over 3\% points for the training dataset. The findings of this study not only offer a cost-effective alternative to the complete overhaul of complex models for performance upgrades but also indicate the potential applicability of our method to other architectures and problem domains.
本文提出了一种利用神经架构搜索(NAS)方法增强预训练神经网络在医学图像分割性能的新方法,特别是 Differentiable Architecture Search(DARTS)。我们介绍了植入式自适应单元(IAC)的概念,这是通过部分连接的DARTS确定的小而强大的模块,旨在被注入到已经训练好的U-Net模型的跳跃连接中。我们的策略使得IAC无缝地融入了预先存在的架构中,从而在不需要从头开始训练的情况下提高了其性能。针对医学图像分割任务的实证研究证明了这种方法的有效性。将专用IAC细胞集成到U-Net模型的各种配置中,平均提高了验证数据集的分割准确度约2%,而在训练数据集上提高了约3%的分割准确度。本研究的结果不仅为提高复杂模型性能的全面修复提供了成本效益的解决方案,而且表明了我们的方法适用于其他架构和问题领域。
https://arxiv.org/abs/2405.03420
Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel \emph{A}ttention-based \emph{F}usion rou\emph{ter} called AFter, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFter against state-of-the-art RGBT trackers. We release the code in this https URL.
多模态特征融合作为RGBT跟踪的核心调查组件近年来出现了许多融合研究。然而,现有的RGBT跟踪方法普遍采用固定的融合结构来整合多模态特征,这些结构在动态场景中难以处理各种挑战。为了解决这个问题,本文提出了一种名为AFter的新颖注意力基础融合路由器,它优化了融合结构以适应动态挑战场景,从而实现稳健的RGBT跟踪。 特别地,我们基于分层注意力网络设计了一个融合结构空间,每个基于注意力的融合单元对应于一个融合操作和一个由这些注意力单元组成的融合结构。通过优化注意力基于融合单位的组合,我们可以动态选择适合各种挑战场景的融合结构。与神经架构搜索算法中不同结构的复杂搜索不同,我们开发了一种动态路由算法,为每个基于注意力的融合单元配备一个路由器,以预测用于有效优化融合结构的组合权重。 在五个主流RGBT跟踪数据集上的大量实验证明,与最先进的RGBT跟踪器相比,所提出的AFter具有卓越的性能。我们还在本文中提供了代码的URL。
https://arxiv.org/abs/2405.02717
Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
预训练语言模型(PLM),例如BERT或RoBERTa,在有标签数据对其进行微调时,在自然语言理解任务上达到了最先进的水平。然而,它们的大型尺寸在部署为现实世界应用的推理时提出了挑战,因为它们对GPU内存需求很高,且推理延迟很高。本文探讨了神经架构搜索(NAS)用于结构剪枝,以找到微调网络的最佳子部分,在效率、模型大小或延迟方面实现最优的平衡,并提高泛化性能。我们还展示了如何利用最近开发的两阶段权重共享NAS方法来加速搜索过程。与传统的剪枝方法不同,我们提出了一个多目标方法,该方法可以找到帕累托最优的子网络集合,实现更加灵活和自动化的压缩过程。
https://arxiv.org/abs/2405.02267
Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.
近年来在数据蒸馏领域的研究旨在通过生成一个压缩合成数据集来最小化训练成本,该数据集包含了较大真实数据集中的信息。这些方法最终旨在实现与整个原始数据集训练出的模型具有相似的测试准确度。之前在特征匹配和分布匹配方面的研究表明,在蒸馏过程中没有产生双层优化成本,但取得了显著的成果。尽管这些方法在节省训练成本方面具有令人满意的效率,但它们在下游性能方面存在微小的改进,对上下文信息的提取有限,并且模型扩展性较差。为了应对这些挑战,我们在数据蒸馏领域提出了ATtentiOn Mixer(ATOM)模块,利用混合通道和空间注意在特征匹配过程中高效地蒸馏大型数据集。空间注意可以帮助根据类在各自图像上的一致定位来指导学习过程,实现从更广泛的感受野进行蒸馏。同时,通道注意可以捕捉类本身相关的上下文信息,从而使合成图像对训练更加有用。通过整合这两种注意,我们的ATOM模块在各种计算机视觉数据集上的表现都超过了之前的水平,包括CIFAR10/100和TinyImagenet。值得注意的是,我们的方法在图像数量较低的情况下显著提高了性能,从而增强了其潜力。此外,我们还保持了在神经架构搜索等方面的改进。
https://arxiv.org/abs/2405.01373
Exploring dense connectivity of convolutional operators establishes critical "synapses" to communicate feature vectors from different levels and enriches the set of transformations on Computer Vision applications. Yet, even with heavy-machinery approaches such as Neural Architecture Search (NAS), discovering effective connectivity patterns requires tremendous efforts due to either constrained connectivity design space or a sub-optimal exploration process induced by an unconstrained search space. In this paper, we propose CSCO, a novel paradigm that fabricates effective connectivity of convolutional operators with minimal utilization of existing design motifs and further utilizes the discovered wiring to construct high-performing ConvNets. CSCO guides the exploration via a neural predictor as a surrogate of the ground-truth performance. We introduce Graph Isomorphism as data augmentation to improve sample efficiency and propose a Metropolis-Hastings Evolutionary Search (MH-ES) to evade locally optimal architectures and advance search quality. Results on ImageNet show ~0.6% performance improvement over hand-crafted and NAS-crafted dense connectivity. Our code is publicly available.
探索卷积操作器的密集连接性建立了关键的“突触”来沟通不同层次的特征向量,并为计算机视觉应用提供了丰富的变换集。然而,即使使用类似于神经架构搜索(NAS)这样的重型工具,发现有效的连接模式也需要付出巨大的努力,这是由于受到约束的连接设计空间或由不约束的搜索空间引起的子最优探索过程。在本文中,我们提出了CSCO,一种新范式,通过最小化对现有设计模式的使用,构建了有效的卷积操作器连接性。CSCO通过神经预测作为真实性能的代理来指导探索。我们引入图等价作为数据增强来提高样本效率,并提出了Metropolis-Hastings Evolutionary Search(MH-ES)来避免局部最优架构和提高搜索质量。在ImageNet上的结果表明,与手工设计和NAS创建的密集连接性相比,性能提高了约0.6%。我们的代码是公开可用的。
https://arxiv.org/abs/2404.17152
As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to balance multiple optimization objectives, including model accuracy, inference speed, and hardware-specific considerations. Despite its importance, benchmarks have yet to be developed to frame such a challenging task as multi-objective optimization. To bridge the gap, we introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. Building upon the streamline, we present a benchmark test suite, CitySeg/MOP, comprising fifteen MOPs derived from the Cityscapes dataset. The CitySeg/MOP test suite is integrated into the EvoXBench platform to provide seamless interfaces with various programming languages (e.g., Python and MATLAB) for instant fitness evaluations. We comprehensively assessed the CitySeg/MOP test suite on various multi-objective evolutionary algorithms, showcasing its versatility and practicality. Source codes are available at this https URL.
作为自动机器学习领域新兴挑战之一,硬件感知的神经架构搜索(HW-NAS)任务可以被视为多目标优化问题(MOPs)。HW-NAS在实时语义分割(Real-time Semantic Segmentation,RSS)中的应用至关重要。为实现实时语义分割,HW-NAS在实时语义分割本身就需要在多个优化目标之间取得平衡,包括模型准确性、推理速度和硬件特定考虑。尽管HW-NAS在实时语义分割中具有重要性,但迄今为止还没有为这样的具有挑战性的任务开发基准。为了填补这一空白,我们引入了一个定制的流线来将HW-NAS在实时语义分割中的任务转化为标准的MOP。在此基础上,我们提出了一个基准测试套件,CitySeg/MOP,包含来自Cityscapes数据集的15个MOP。CitySeg/MOP测试套件已集成到EvoXBench平台中,为各种编程语言(例如Python和MATLAB)提供了一个无缝的界面来进行即时的健身评估。我们对CitySeg/MOP测试套件进行了全面评估,展示了其多才性和实用性。源代码可以从此链接获取:https://url.cn/xyz6uJ4
https://arxiv.org/abs/2404.16266
Neural architecture search (NAS) is a challenging problem. Hierarchical search spaces allow for cheap evaluations of neural network sub modules to serve as surrogate for architecture evaluations. Yet, sometimes the hierarchy is too restrictive or the surrogate fails to generalize. We present FaDE which uses differentiable architecture search to obtain relative performance predictions on finite regions of a hierarchical NAS space. The relative nature of these ranks calls for a memory-less, batch-wise outer search algorithm for which we use an evolutionary algorithm with pseudo-gradient descent. FaDE is especially suited on deep hierarchical, respectively multi-cell search spaces, which it can explore by linear instead of exponential cost and therefore eliminates the need for a proxy search space. Our experiments show that firstly, FaDE-ranks on finite regions of the search space correlate with corresponding architecture performances and secondly, the ranks can empower a pseudo-gradient evolutionary search on the complete neural architecture search space.
神经架构搜索(NAS)是一个具有挑战性的问题。分层搜索空间允许对神经网络子模块进行廉价的评估,作为架构评估的代理。然而,有时候分层结构过于严格,或者代理无法泛化。我们提出了FaDE,它使用不同的iable架构搜索来获得分层 NAS 空间中有限区域的相对性能预测。这些相对排名的性质要求我们使用进化算法(我们使用具有伪梯度的进化算法)进行无记忆、批量的外搜索。FaDE 特别适用于具有深度分层和多细胞搜索空间的NAS,通过线性成本而不是指数成本进行探索,因此无需代理搜索空间。我们的实验结果表明,首先,FaDE在搜索空间有限区域上的排名与相应的架构性能相关联,其次,排名可以推动在完整神经架构搜索空间上的伪梯度进化搜索。
https://arxiv.org/abs/2404.16218
Unsupervised domain adaptation (UDA) is a challenging open problem in land cover mapping. Previous studies show encouraging progress in addressing cross-domain distribution shifts on remote sensing benchmarks for land cover mapping. The existing works are mainly built on large neural network architectures, which makes them resource-hungry systems, limiting their practical impact for many real-world applications in resource-constrained environments. Thus, we proposed a simple yet effective framework to search for lightweight neural networks automatically for land cover mapping tasks under domain shifts. This is achieved by integrating Markov random field neural architecture search (MRF-NAS) into a self-training UDA framework to search for efficient and effective networks under a limited computation budget. This is the first attempt to combine NAS with self-training UDA as a single framework for land cover mapping. We also investigate two different pseudo-labelling approaches (confidence-based and energy-based) in self-training scheme. Experimental results on two recent datasets (OpenEarthMap & FLAIR #1) for remote sensing UDA demonstrate a satisfactory performance. With only less than 2M parameters and 30.16 GFLOPs, the best-discovered lightweight network reaches state-of-the-art performance on the regional target domain of OpenEarthMap (59.38% mIoU) and the considered target domain of FLAIR #1 (51.19% mIoU). The code is at this https URL}{this https URL.
未经监督的领域适应(UDA)是土地覆盖制图领域的一个具有挑战性的开放问题。以前的研究表明,在遥感基准测试中解决跨领域分布变化对于土地覆盖制图取得了鼓舞人心的进展。现有的工作主要是基于大型神经网络架构构建的,这使得它们成为资源密集型系统,对于许多资源受限环境中的实际应用限制了其实际影响。因此,我们提出了一个简单而有效的框架,用于在领域变化下搜索轻量级神经网络,以解决土地覆盖制图任务。这是第一个将 NAS 与自训练 UDA 相结合作为一个单一框架的土地覆盖制图问题。我们还研究了两种不同的伪标签方法(基于信度的和基于能量的方法)在自训练方案中的效果。对于遥感 UDA 的两个最近的数据集(OpenEarthMap & FLAIR #1),实验结果表明,轻量级网络在仅不到 2M 个参数和 30.16 GFLOPs 的条件下,在 OpenEarthMap 的区域目标域(59.38% mIoU)和 FLAIR #1 考虑的目标域(51.19% mIoU)的性能达到最先进水平。代码位于此链接:<https://this URL>
https://arxiv.org/abs/2404.14704
Edge machine learning (ML) enables localized processing of data on devices and is underpinned by deep neural networks (DNNs). However, DNNs cannot be easily run on devices due to their substantial computing, memory and energy requirements for delivering performance that is comparable to cloud-based ML. Therefore, model compression techniques, such as pruning, have been considered. Existing pruning methods are problematic for edge ML since they: (1) Create compressed models that have limited runtime performance benefits (using unstructured pruning) or compromise the final model accuracy (using structured pruning), and (2) Require substantial compute resources and time for identifying a suitable compressed DNN model (using neural architecture search). In this paper, we explore a new avenue, referred to as Pruning-at-Initialization (PaI), using structured pruning to mitigate the above problems. We develop Reconvene, a system for rapidly generating pruned models suited for edge deployments using structured PaI. Reconvene systematically identifies and prunes DNN convolution layers that are least sensitive to structured pruning. Reconvene rapidly creates pruned DNNs within seconds that are up to 16.21x smaller and 2x faster while maintaining the same accuracy as an unstructured PaI counterpart.
边缘机器学习 (ML) 可以在设备上进行局部数据处理,这是由深度神经网络 (DNN) 支撑的。然而,由于 DNN 对计算、内存和能源的需求很高,因此它们在设备上运行并不容易。因此,人们考虑使用模型压缩技术,例如剪枝。 现有的剪枝方法在边缘 ML 上存在问题,因为它们: (1)使用无结构剪枝创建压缩模型,其运行时间性能提升有限(使用无结构剪枝)或牺牲了最终模型的准确性(使用结构剪枝), (2)需要大量的计算资源和时间来寻找合适的压缩 DNN 模型(使用神经架构搜索)。 在本文中,我们探讨了一个新的途径,称为剪枝-在初始化时(PaI),使用结构剪枝来减轻上述问题。我们开发了 Reconvene,一个系统,用于通过结构剪枝快速生成适合边缘部署的修剪模型。Reconvene 系统有系统地识别并剪除对结构剪枝最不敏感的 DNN 卷积层。Reconvene 可以在几秒钟内创建大小为 16.21 倍的压缩 DNN 模型,速度为 2 倍,同时保持与无结构剪枝的准确性相同的性能。
https://arxiv.org/abs/2404.16877
Recently, several approaches successfully demonstrated that weight-sharing Neural Architecture Search (NAS) can effectively explore a search space of elastic low-rank adapters (LoRA), allowing the parameter-efficient fine-tuning (PEFT) and compression of large language models. In this paper, we introduce a novel approach called Shears, demonstrating how the integration of cost-effective sparsity and a proposed Neural Low-rank adapter Search (NLS) algorithm can further improve the efficiency of PEFT approaches. Results demonstrate the benefits of Shears compared to other methods, reaching high sparsity levels while improving or with little drop in accuracy, utilizing a single GPU for a pair of hours.
近年来,几种方法成功地证明了权重共享神经架构搜索(NAS)可以有效探索弹性低秩适应器(LoRA)的搜索空间,使得参数高效的微调(PEFT)和大语言模型的压缩。在本文中,我们引入了一种名为Shears的新方法,证明了将经济效益的稀疏性和所提出的神经低秩适应器搜索(NLS)算法相结合可以进一步提高PEFT方法的效率。结果表明,Shears相对于其他方法具有明显的优势,可以在高稀疏水平上提高精度,或者在准确性略有下降的情况下提高效率,同时使用一个GPU进行两个小时的任务。
https://arxiv.org/abs/2404.10934
We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.
我们向您介绍最新的 MobileNets generation,被称为 MobileNetV4(MNv4),它具有移动设备上普遍高效的架构设计。在其核心,我们引入了统一且灵活的结构 - 逆瓶颈(UIB)搜索块,将逆瓶颈(IB)、ConvNext、Feed Forward Network(FFN)和新颖的 Extra Depthwise(ExtraDW)结合在一起。与 UIB 一起,我们还介绍了 Mobile MQA,一种专为移动加速器设计的关注块,实现了显著的 39% 的速度提升。此外,还引入了优化的神经架构搜索(NAS)食谱,提高了 MNv4 的搜索效果。将 UIB、移动 MQA 和优化的 NAS 食谱相结合,产生了在移动 CPUs、DSPs、GPUs 和专用加速器(如苹果 Neural Engine 和谷歌 Pixel EdgeTPU)上普遍最优的 MNv4 模型 - 这是其他模型中没有发现的特征。最后,为了进一步提高准确性,我们引入了一种新的蒸馏技术。通过这种技术,我们的 MNv4-Hybrid-Large 模型在 ImageNet-1K 上的准确率达到了 87%,Pixel 8 EdgeTPU 运行时间仅为 3.8ms。
https://arxiv.org/abs/2404.10518
Hardware-aware Neural Architecture Search approaches (HW-NAS) automate the design of deep learning architectures, tailored specifically to a given target hardware platform. Yet, these techniques demand substantial computational resources, primarily due to the expensive process of assessing the performance of identified architectures. To alleviate this problem, a recent direction in the literature has employed representation similarity metric for efficiently evaluating architecture performance. Nonetheless, since it is inherently a single objective method, it requires multiple runs to identify the optimal architecture set satisfying the diverse hardware cost constraints, thereby increasing the search cost. Furthermore, simply converting the single objective into a multi-objective approach results in an under-explored architectural search space. In this study, we propose a Multi-Objective method to address the HW-NAS problem, called MO-HDNAS, to identify the trade-off set of architectures in a single run with low computational cost. This is achieved by optimizing three objectives: maximizing the representation similarity metric, minimizing hardware cost, and maximizing the hardware cost diversity. The third objective, i.e. hardware cost diversity, is used to facilitate a better exploration of the architecture search space. Experimental results demonstrate the effectiveness of our proposed method in efficiently addressing the HW-NAS problem across six edge devices for the image classification task.
硬件感知的神经架构搜索方法(HW-NAS)自动设计适用于特定目标硬件平台的深度学习架构。然而,这些技术需要大量的计算资源,主要原因是确定架构性能的过程代价昂贵。为解决这个问题,文献中最近的一个方向采用表示相似性度量来高效评估架构性能。然而,由于它本质上是一个单目标方法,因此需要多次运行来找到满足多样硬件成本约束的最优架构集合,从而增加搜索成本。此外,将单目标转换为多目标方法导致了一个未被探索的建筑搜索空间。在本研究中,我们提出了一种多目标方法来解决HW-NAS问题,称为MO-HDNAS,以在低计算成本的单次运行中识别架构的权衡集。这是通过优化三个目标来实现的:最大化表示相似性度量,最小化硬件成本,最大化硬件成本多样性。第三个目标,即硬件成本多样性,用于促进更好地探索架构搜索空间。实验结果表明,我们提出的方法在有效地解决图像分类任务的六个边缘设备上的HW-NAS问题方面非常有效。
https://arxiv.org/abs/2404.12403
The key to device-edge co-inference paradigm is to partition models into computation-friendly and computation-intensive parts across the device and the edge, respectively. However, for Graph Neural Networks (GNNs), we find that simply partitioning without altering their structures can hardly achieve the full potential of the co-inference paradigm due to various computational-communication overheads of GNN operations over heterogeneous devices. We present GCoDE, the first automatic framework for GNN that innovatively Co-designs the architecture search and the mapping of each operation on Device-Edge hierarchies. GCoDE abstracts the device communication process into an explicit operation and fuses the search of architecture and the operations mapping in a unified space for joint-optimization. Also, the performance-awareness approach, utilized in the constraint-based search process of GCoDE, enables effective evaluation of architecture efficiency in diverse heterogeneous systems. We implement the co-inference engine and runtime dispatcher in GCoDE to enhance the deployment efficiency. Experimental results show that GCoDE can achieve up to $44.9\times$ speedup and $98.2\%$ energy reduction compared to existing approaches across various applications and system configurations.
设备边缘协同推理范式的关键在于将模型在设备和边缘之间 partition成计算友好和计算密集的部分。然而,对于图神经网络(GNNs),我们发现,仅仅通过分割模型而不改变其结构,很难实现协同推理范式的全部潜力,因为GNN操作在异构设备上的计算通信开销 various。我们提出了GCoDE,第一个自动框架,创新地协同设计GNN的架构搜索和每个操作在设备-边缘层次结构上的映射。GCoDE将设备通信过程抽象为一个显式操作,并将搜索架构和操作映射统一到一个联合优化的空间。此外,GCoDE使用的性能感知方法使得在各种异构系统中的架构效率有效评估。我们在GCoDE中实现了协同推理引擎和运行时调度器,以提高部署效率。实验结果表明,GCoDE可以在各种应用和系统配置上实现最高速度up至44.9倍,能量减少至98.2%。
https://arxiv.org/abs/2404.05605