Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy-preserving Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery. We introduce DeepFedNAS, a novel, two-phase framework underpinned by a principled, multi-objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re-engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre-computed Pareto-optimal cache of high-fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor-Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero-cost proxy for accuracy, enabling on-demand subnet discovery in mere seconds. DeepFedNAS achieves state-of-the-art accuracy (e.g., up to 1.21% absolute improvement on CIFAR-100), superior parameter and communication efficiency, and a substantial ~61x speedup in total post-training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial cache generation) and enabling 20-second individual subnet searches, DeepFedNAS makes hardware-aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: this https URL
联邦神经架构搜索(FedNAS)旨在为隐私保护的联邦学习(FL)自动化模型设计,但目前面临两个关键瓶颈:缺乏指导的超网训练导致次优模型产生,以及耗时数小时的后训练子网络发现流程。我们提出了DeepFedNAS,这是一个新颖的两阶段框架,基于一个原则性的多目标适应性函数,将数学网络设计与架构启发式相结合。借助重新设计的超网,DeepFedNAS引入了联邦帕累托最优超网训练方法,该方法利用预计算的高适应性架构帕累托缓存作为智能课程表来优化共享超网权重。随后,其无需预测器搜索方法通过使用此适应性函数直接替代精度代理(零成本),从而消除昂贵的精度代理需求,实现按需子网络发现仅需几秒即可完成。 DeepFedNAS实现了最先进的准确性(例如,在CIFAR-100上高达1.21%绝对改进)、卓越的参数和通信效率,并且在总后训练搜索流水线时间上提高了约61倍的速度。通过将流程从超过20小时缩短至大约20分钟(包括初始缓存生成)并实现每个子网络仅需20秒的单独搜索,DeepFedNAS使硬件感知FL部署即时和实用化。 完整的源代码和实验脚本可在以下网址获取:[此URL]
https://arxiv.org/abs/2601.15127
Unsupervised deep image prior (DIP) addresses shortcomings of training data requirements and limited generalization associated with supervised deep learning. The performance of DIP depends on the network architecture and the stopping point of its iterative process. Optimizing these parameters for a new image requires time, restricting DIP application in domains where many images need to be processed. Focusing on fluorescence microscopy data, we hypothesize that similar images share comparable optimal parameter configurations for DIP-based denoising, potentially enabling optimization-free DIP for fluorescence microscopy. We generated a calibration (n=110) and validation set (n=55) of semantically different images from an open-source dataset for a network architecture search targeted towards ideal U-net architectures and stopping points. The calibration set represented our transfer basis. The validation set enabled the assessment of which image similarity criterion yields the best results. We then implemented AUTO-DIP, a pipeline for automatic parameter transfer, and compared it to the originally published DIP configuration (baseline) and a state-of-the-art image-specific variational denoising approach. We show that a parameter transfer from the calibration dataset to a test image based on only image metadata similarity (e.g., microscope type, imaged specimen) leads to similar and better performance than a transfer based on quantitative image similarity measures. AUTO-DIP outperforms the baseline DIP (DIP with original DIP parameters) as well as the variational denoising approaches for several open-source test datasets of varying complexity, particularly for very noisy inputs. Applications to locally acquired fluorescence microscopy images further proved superiority of AUTO-DIP.
无监督深度图像先验(DIP)解决了有监督深度学习中训练数据需求和泛化能力有限的问题。DIP的性能取决于网络架构及其迭代过程的停止点。为了一个新的图像优化这些参数需要时间,这限制了在处理大量图像领域使用DIP的应用范围。针对荧光显微镜数据,我们假设相似的图像具有类似的最佳参数配置,这有可能实现无需优化的DIP应用。为此,我们在一个开源的数据集中生成了一个校准集(n=110)和验证集(n=55),这些集合包含语义上不同的图像,并针对理想的U-net架构和停止点进行了网络结构搜索。校准集代表了我们的迁移基础,而验证集则用于评估哪种图像相似性标准能产生最佳结果。 我们随后实施了一种自动参数转移管道AUTO-DIP,并将其与最初发布的DIP配置(基准线)以及一种最先进的特定图像变分去噪方法进行比较。结果显示,仅基于图像元数据相似性(如显微镜类型、成像标本等),从校准数据集向测试图像传输参数可以实现类似甚至更好的性能,相比基于定量图像相似度测量的转移方法而言。 对于几个不同复杂程度的开源测试数据集,AUTO-DIP优于基准DIP(使用原始DIP参数)以及变分去噪方法,特别是在非常嘈杂的数据输入方面。应用到本地采集的荧光显微镜图像中进一步证明了AUTO-DIP的优势。
https://arxiv.org/abs/2601.12055
Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.
前沿语言模型的能力正在迅速提升,因此我们需要更强的措施来防止恶意用户滥用日益强大的系统。先前的研究表明,激活探测器可能是一种有前景的误用缓解技术,但我们识别到一个关键挑战:现有的探测器在面对重要的生产分布变化时无法泛化。特别是我们发现,从短上下文输入向长上下文输入的变化对现有探测器架构来说是一个难题。 为此,我们提出了一些新的探测器架构来处理这种长上下文的分布偏移。我们在网络攻击领域测试了这些探测器的鲁棒性,评估它们在各种生产相关变化中的表现,包括多轮对话、静态破解和适应性的红队行动(模拟对手)。我们的结果显示,虽然多重最大值方法可以解决上下文长度的问题,但要实现广泛泛化,则需要结合架构选择与训练多样化分布。此外,我们还发现将探测器与提示分类器配对可以在低计算成本下达到最佳准确率。 这些研究成果已经指导了谷歌前沿语言模型Gemini中误用缓解探测器的成功部署。最后,我们通过AlphaEvolve早期获得了积极的结果,用于自动化改进探测器架构搜索和适应性红队行动,表明部分AI安全性研究的自动化已经成为可能。
https://arxiv.org/abs/2601.11516
Image enhancement is a critical task in computer vision and photography that is often entangled with noise. This renders the traditional Image Signal Processing (ISP) ineffective compared to the advances in deep learning. However, the success of such methods is increasingly associated with the ease of their deployment on edge devices, such as smartphones. This work presents a novel mobile-friendly network for image de-noising obtained with Entropy-Regularized differentiable Neural Architecture Search (NAS) on a hardware-aware search space for a U-Net architecture, which is first-of-its-kind. The designed model has 12% less parameters, with ~2-fold improvement in ondevice latency and 1.5-fold improvement in the memory footprint for a 0.7% drop in PSNR, when deployed and profiled on Samsung Galaxy S24 Ultra. Compared to the SOTA Swin-Transformer for Image Restoration, the proposed network had competitive accuracy with ~18-fold reduction in GMACs. Further, the network was tested successfully for Gaussian de-noising with 3 intensities on 4 benchmarks and real-world de-noising on 1 benchmark demonstrating its generalization ability.
图像增强是计算机视觉和摄影中的一个关键任务,通常会受到噪声的影响。这使得传统的图像信号处理(ISP)在与深度学习的进展相比时显得不够有效。然而,此类方法的成功越来越依赖于其在边缘设备(如智能手机)上部署的简便性。本工作提出了一种新型的手机友好型网络用于图像去噪,该网络通过基于硬件感知搜索空间上的熵正则化可微神经架构搜索(NAS)获得,这是首次实现这一概念的设计。 设计的模型参数减少了12%,在三星Galaxy S24 Ultra上部署和评估时,设备端延迟提高了约两倍,内存占用减少了一半,而PSNR仅下降了0.7%。与最先进的Swin-Transformer图像恢复方法相比,该提议的网络具有竞争性的准确性,并且GMACs减少了大约18倍。 此外,在四个基准测试上成功对三个强度级别的高斯去噪进行了测试,并在一个真实世界基准测试中验证了其去噪能力,从而证明了该模型具备良好的泛化能力。
https://arxiv.org/abs/2601.11684
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We address both issues by first transferring weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and second, introducing a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
翻译如下: 基于Transformer的架构通过密集全注意力机制提供了最先进的准确性,但它们的时间和内存复杂度与序列长度呈二次关系,这限制了其实用部署。线性注意力机制则提供线性的或接近线性的缩放比例,尽管如此,往往会导致性能下降。混合模型将全注意力层和线性注意力层结合在一起,旨在平衡效率和表达能力,但面临着两大挑战:从头开始训练此类混合模型计算成本高昂,而手动设计最佳注意力类型放置方案极其复杂且不切实际。 为了解决这两个问题,我们首先通过分块本地蒸馏的方式,将预训练的全注意力模块权重转移到其线性注意力对应模块中。其次,我们引入了一种贪婪层替换策略:该策略在监测目标任务验证性能的同时,迭代地用线性注意力模块替代全注意力模块。这可以在一次高效的遍历过程中生成特定任务的混合模型,而无需耗时的重新训练或神经架构搜索,并且可以应用于任何预训练的全注意力骨干网络以应对各种下游任务。
https://arxiv.org/abs/2601.11667
Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at this https URL
多智能体系统(MAS)通过协调多个代理来实现复杂的推理,但通常由于多步执行和重复模型调用而导致高推断延迟,这严重限制了它们在时间敏感场景中的可扩展性和可用性。大多数现有的方法主要优化任务性能和推断成本,并且明确或隐含地假设顺序执行,使得这些方法在并行执行时控制延迟的效果较差。 在这项工作中,我们研究了在并行执行中带有显式延迟监督的基于学习的多智能体系统编排。我们提出了一个具有意识的延迟多代理系统(LAMaS),这是一种延迟感知的多代理编排框架,它能够进行并行执行,并明确地优化关键执行路径,使得控制器能够在并行执行时构建延迟较低的执行拓扑图。 我们的实验表明,在多个基准测试中,与最先进的多代理架构搜索基线相比,我们所提出的方法在保持或甚至改进任务性能的同时,可以将关键路径长度减少38-46%。这些结果突显了在设计高效的多智能体系统时明确优化并行执行中的延迟的重要性。 代码可在提供的链接处获取。
https://arxiv.org/abs/2601.10560
Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets. We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot. In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration. We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations. While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance. This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies. Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy. Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning.
深度神经网络中层规格(如层宽度)的优化构成了一个复杂的组合挑战,受到张量形状兼容性和计算预算的约束。我们认为大型语言模型 (LLM) 可以为神经架构搜索 (NAS) 提供一种变革性的方法,能够以传统启发式方法无法实现的方式推理关于架构代码结构的问题。在本文中,我们探讨了将基于 LLM 的 NAS 框架应用于通道配置问题的应用。我们将搜索形式化为一系列条件代码生成任务,在这些任务中,LLM 根据性能遥测数据来细化架构规范。关键地,我们通过抽象语法树(AST)变异生成大量有效且形状一致的架构来解决数据稀缺的问题。虽然这些变异网络未必是高性能的,但它们提供了 LLM 学习通道配置与模型表现之间潜在关系所需的关键结构数据量。这使得 LLM 能够内化复杂的设计模式,并将其应用于优化特征提取策略。 在 CIFAR-100 数据集上的实验结果验证了这种方法的有效性,证明该模型能够显著提高准确率(具有统计学意义)。我们的分析证实,LLM 成功地获得了特定领域的架构先验知识,这使本方法与随机搜索区分开来,并突显出语言驱动设计在深度学习中的巨大潜力。
https://arxiv.org/abs/2601.08517
The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.
在资源受限的边缘设备上部署基于Transformer的模型,对于实现实时人工智能应用构成了一个关键挑战。这篇全面的综述研究了专门为边缘部署设计的轻量级Transformer架构,并分析了最近关于模型压缩、量化、剪枝和知识蒸馏技术的进步。我们系统地回顾了一些著名的轻量级变体,包括MobileBERT、TinyBERT、DistilBERT、EfficientFormer、EdgeFormer和MobileViT,并在GLUE、SQuAD、ImageNet-1K和COCO等标准数据集上提供了详细的性能基准测试。我们的分析涵盖了主要硬件平台(如NVIDIA Jetson、Qualcomm Snapdragon、Apple Neural Engine及ARM架构)、部署框架(TensorFlow Lite、ONNX Runtime、PyTorch Mobile及CoreML)以及优化策略的当前行业采纳模式。 实验结果表明,现代轻量级Transformer可以在保持75%-96%全模型精度的同时,将模型大小减少4-10倍,推理延迟降低3-9倍,从而能够在低至2-5W功耗的设备上部署。我们确定了稀疏注意机制、混合精度量化(INT8/FP16)和硬件感知神经架构搜索作为最有效的优化策略。 新颖的研究发现包括对内存带宽瓶颈分析,显示具有15-40M参数的模型能够实现最佳的硬件利用效率(60%-75%),针对不同类型模型的最佳量化点,以及跨边缘平台的全面能效配置文件。我们建立了实时性能边界,并提供了一个实用的六步部署管道,在减少8-12倍大小的同时保证了不到2%的精度退化。
https://arxiv.org/abs/2601.03290
Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.
自动化神经网络架构设计在计算机视觉领域仍是一项重要挑战。任务多样性和计算约束要求既有效又高效的搜索方法。大型语言模型(LLMs)为计算密集型的神经架构搜索(NAS)提供了一种有希望的替代方案,但在计算机视觉架构生成的应用方面尚未系统地研究过,尤其是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架,这项工作引入并验证了两个关键贡献,以促进计算机视觉领域的进步。 首先,我们提出少样本架构提示(Few-Shot Architecture Prompting, FSAP),这是第一个关于LLM基架构生成所需的辅助示例数量(n=1,2,3,4,5,6)的系统性研究。研究表明使用n=3个例子可以在保持视觉任务多样性的同时最好地聚焦上下文。 其次,我们引入了空白标准化哈希验证方法,这是一种轻量级去重技术(少于1毫秒),相比抽象语法树解析提供了100倍的速度提升,并且可以防止重复训练相同的计算机视觉架构。在七个大规模的计算机视觉基准测试集(MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN和Places365)上,我们生成了超过1,900种独特架构。 此外,为了应对不同异构视觉任务之间比较架构的挑战,我们还提出了一种数据集平衡评估方法。这些贡献为基于LLM的计算机视觉领域架构搜索提供了切实可行的操作指南,并确立了严格的评价实践标准,使自动设计对于计算资源有限的研究人员来说更加可及。
https://arxiv.org/abs/2512.24120
This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~$\mu$J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.
这篇论文在严格的小型机器学习(TinyML)约束条件下,使用迭代硬件感知神经架构搜索框架,在TACO数据集上进行垃圾检测研究,该框架面向边缘计算和物联网设备。提出的方法构建了一个Once-for-All风格的ResDets超级网络,并通过交替优化骨干网和颈部/头部部分来进行迭代进化搜索,同时利用种群传递机制和准确度预测器来减少搜索成本并提高稳定性。这种框架生成了一组部署就绪的检测器,称为TrashDets。 在TACO的一个五类子集(纸、塑料、瓶子、罐头、烟蒂)上,最强版本TrashDet-l实现了30.5M参数和19.5 mAP50的性能,在使用明显更少的参数的同时,比之前的检测器提高了高达3.6 mAP50的准确性。TrashDets系列涵盖了从1.2M到30.5M不等的参数量,并且mAP50值在11.4和19.5之间变化,为资源受限硬件上的多样TinyML部署预算提供了可扩展的选择。 在配备了TrashNet数据集的MAX78002微控制器上,两种专门设计的变体——TrashDet-ResNet和TrashDet-MBNet——共同超越了ai87-fpndetector基准。其中,TrashDet-ResNet实现了每推理7525 μJ的能量消耗,在26.7毫秒延迟和37.45 FPS速度下工作;而TrashDet-MBNet将mAP50提高了10.2%。这两者相比现有的TinyML检测器,最多可以减少高达88%的能耗、78%的延迟,并且平均功率最多降低53%。
https://arxiv.org/abs/2512.20746
Evolutionary Neural Architecture Search (ENAS) has gained attention for automatically designing neural network architectures. Recent studies use a neural predictor to guide the process, but the high computational costs of gathering training data -- since each label requires fully training an architecture -- make achieving a high-precision predictor with { limited compute budget (i.e., a capped number of fully trained architecture-label pairs)} crucial for ENAS success. This paper introduces ENAS with Dual Contrastive Learning (DCL-ENAS), a novel method that employs two stages of contrastive learning to train the neural predictor. In the first stage, contrastive self-supervised learning is used to learn meaningful representations from neural architectures without requiring labels. In the second stage, fine-tuning with contrastive learning is performed to accurately predict the relative performance of different architectures rather than their absolute performance, which is sufficient to guide the evolutionary search. Across NASBench-101 and NASBench-201, DCL-ENAS achieves the highest validation accuracy, surpassing the strongest published baselines by 0.05\% (ImageNet16-120) to 0.39\% (NASBench-101). On a real-world ECG arrhythmia classification task, DCL-ENAS improves performance by approximately 2.5 percentage points over a manually designed, non-NAS model obtained via random search, while requiring only 7.7 GPU-days.
进化神经架构搜索(ENAS)因其能够自动设计神经网络结构而受到关注。近期的研究采用了神经预测器来指导这一过程,但由于收集训练数据所需的高昂计算成本——因为每个标签都需要对一个完整的架构进行完全训练——因此,在有限的计算预算内(即一定数量的完全训练过的架构-标签配对),实现高精度的预测对于ENAS的成功至关重要。 本文提出了一种名为具有双对比学习的ENAS (DCL-ENAS) 的新方法,该方法采用两个阶段的对比学习来训练神经预测器。在第一阶段,使用对比自监督学习从神经架构中学习有意义的表示,无需标签。在第二阶段,则通过对比学习进行微调,以准确预测不同架构之间的相对性能,而非绝对性能——这一点足以引导进化搜索过程。 在NASBench-101和NASBench-201两个基准测试上,DCL-ENAS实现了最高的验证精度,在ImageNet16-120和NASBench-101上的表现分别超越了最强的已发布基线模型0.05% 和 0.39%。在一项真实世界的心电图心律失常分类任务中,DCL-ENAS相较于手动设计且非NAS方法通过随机搜索获得的模型,在仅使用7.7个GPU天的情况下,提高了约2.5个百分点的性能。 此研究展示了对比学习技术如何优化神经架构搜索过程,并可能为未来的自动机器学习开发提供新的方向。
https://arxiv.org/abs/2512.20112
Thanks to the evolving network depth, convolutional neural networks (CNNs) have achieved remarkable success across various embedded scenarios, paving the way for ubiquitous embedded intelligence. Despite its promise, the evolving network depth comes at the cost of degraded hardware efficiency. In contrast to deep networks, shallow networks can deliver superior hardware efficiency but often suffer from inferior accuracy. To address this dilemma, we propose Double-Win NAS, a novel deep-to-shallow transformable neural architecture search (NAS) paradigm tailored for resource-constrained intelligent embedded systems. Specifically, Double-Win NAS strives to automatically explore deep networks to first win strong accuracy, which are then equivalently transformed into their shallow counterparts to further win strong hardware efficiency. In addition to search, we also propose two enhanced training techniques, including hybrid transformable training towards better training accuracy and arbitrary-resolution elastic training towards enabling natural network elasticity across arbitrary input resolutions. Extensive experimental results on two popular intelligent embedded systems (i.e., NVIDIA Jetson AGX Xavier and NVIDIA Jetson Nano) and two representative large-scale datasets (i.e., ImageNet and ImageNet-100) clearly demonstrate the superiority of Double-Win NAS over previous state-of-the-art NAS approaches.
得益于网络深度的演进,卷积神经网络(CNN)在各种嵌入式场景中取得了显著的成功,为无处不在的嵌入式智能铺平了道路。尽管其前景广阔,但随着网络深度的增加,硬件效率也随之下降。与深层网络相比,浅层网络可以提供更优的硬件效率,但却常常因为精度较低而逊色。为了应对这一困境,我们提出了Double-Win NAS,这是一种新颖的从深到浅可变换神经架构搜索(NAS)范式,专为资源受限的智能嵌入式系统设计。具体而言,Double-Win NAS致力于自动探索深层网络以首先获得强大的准确性,然后将其等效地转换为其浅层版本,从而进一步提高硬件效率。除了搜索之外,我们还提出了两种增强训练技术,包括混合可变换训练(旨在提高训练精度)和任意分辨率弹性训练(用于实现跨任意输入分辨率的自然网络弹性)。在两个流行的智能嵌入式系统(即NVIDIA Jetson AGX Xavier和NVIDIA Jetson Nano)以及两个代表性的大规模数据集(即ImageNet和ImageNet-100)上的大量实验结果清楚地表明,Double-Win NAS优于之前的最先进的NAS方法。
https://arxiv.org/abs/2512.19731
Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods
翻译: Transformer架构搜索(TAS)旨在自动发现高效的视觉变换器(ViTs),减少手动设计的需求。现有的TAS方法通常训练一个过度参数化的网络(即超网),该网络包含所有候选架构(即子网)。然而,所有的子网都共享同一组权重,这导致了干扰,严重影响较小的子网的表现。我们发现,经过良好训练的小型子网可以作为训练更大规模子网的良好基础。受此启发,我们提出了一种渐进式训练框架,称为GrowTAS,该框架从训练小型子网开始,并逐渐加入更大的子网。这有助于减少干扰并稳定训练过程。此外,我们还引入了GrowTAS+,它仅微调部分权重,以进一步提升大规模子网的表现。 在ImageNet和多个迁移学习基准测试(包括CIFAR-10/100、Flowers、Cars和INAT-19)上的广泛实验表明,我们的方法比当前的TAS方法更有效。
https://arxiv.org/abs/2512.12296
Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM's data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data "first principles" is more critical for achieving a superior architecture than simply retrieving SOTA components.
设计高性能的目标检测架构是一项复杂的任务,传统的手动设计耗时且劳动密集型,而神经架构搜索(NAS)则计算成本高昂。尽管最近使用大型语言模型(LLMs)的方法显示出前景,但它们通常作为搜索循环中的迭代优化器工作,而不是直接从数据的全面理解生成架构。为了解决这一差距,我们提出了Cognitive-YOLO,这是一种新的框架,通过该框架利用大型语言模型驱动的架构合成来直接从数据集的基本特征生成网络配置。 我们的方法分为三个阶段:首先,分析模块从目标数据集中提取关键元特征(例如对象尺度分布和场景密度);其次,LLM在这些特征的基础上进行推理,并借助检索增强生成(RAG)获取最先进的组件,以将架构合成为结构化的神经架构描述语言(NADL);最后,一个编译器将这种描述实例化为可部署的模型。 我们在五个不同的目标检测数据集上进行了广泛的实验,结果表明我们提出的Cognitive-YOLO能够一致地生成优于基准的架构,在多个基准测试中实现了具有竞争力的表现,并且在性能与参数之间的权衡方面表现更优。尤为重要的是,我们的消融研究证明了LLM的数据驱动推理是性能的主要驱动力,显示出对数据“基本原则”的深入了解比单纯检索最先进的组件对于实现优越架构更为关键。
https://arxiv.org/abs/2512.12281
Driver fatigue remains a leading cause of road accidents, with 24\% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6\% and mAP by 5\% over video-level supervision, achieving 99.34\% classification accuracy and 95.69\% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.
驾驶员疲劳仍然是导致道路事故的主要原因之一,24%的交通事故涉及嗜睡驾驶者。虽然打哈欠是疲劳早期的行为表现之一,但现有的机器学习方法在处理视频注释数据集时面临系统性噪声问题,这些噪声来自于粗糙的时间标注。为此,我们开发了一种半自动化的标签生成流程,并引入了人类验证环节(human-in-the-loop verification),并将此应用于YawDD数据集中,以实现更准确的模型训练。 通过这种改进的数据处理方式,在YawDD+上对已确立的MNasNet分类器和YOLOv11检测架构进行训练后,帧精度提高了多达6%,mAP(平均精度均值)则提升了5%。与视频级别的监督相比,这一方法达到了99.34%的分类准确率和95.69%的目标检测mAP。 该技术在边缘AI硬件(如NVIDIA Jetson Nano)上运行时,可实现每秒最多59.8帧的速度,这表明通过提高数据质量,可以在不依赖服务器端计算的情况下,在设备端实现实时监测驾驶员打哈欠。
https://arxiv.org/abs/2512.11446
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: this https URL
立体基础模型能够实现强大的零样本泛化能力,但计算成本过高,难以用于实时应用。而高效的立体架构则为了速度牺牲了鲁棒性,并且需要昂贵的领域特定微调。为弥合这一差距,我们提出了Fast-FoundationStereo,这是一个实现了迄今为止首个在实时帧率下达到强大零样本泛化的架构系列。我们采用了一种“分治”加速策略,包括三个组成部分:(1)知识蒸馏将混合骨干网络压缩成一个单一的高效学生模型;(2)块级神经架构搜索,在延迟预算下自动发现最优成本过滤设计,并大幅减少搜索复杂度;(3)结构化剪枝消除迭代细化模块中的冗余。此外,我们还引入了一个自动伪标签流水线,用于整理140万对野外立体图像以补充合成训练数据并促进知识蒸馏。结果模型比FoundationStereo运行速度快超过10倍,同时其零样本精度与后者紧密匹配,从而在实时方法中建立了新的最先进水平。 项目页面: [请在此处输入实际的链接或URL]
https://arxiv.org/abs/2512.11130
Early-exit networks are effective solutions for reducing the overall energy consumption and latency of deep learning models by adjusting computation based on the complexity of input data. By incorporating intermediate exit branches into the architecture, they provide less computation for simpler samples, which is particularly beneficial for resource-constrained devices where energy consumption is crucial. However, designing early-exit networks is a challenging and time-consuming process due to the need to balance efficiency and performance. Recent works have utilized Neural Architecture Search (NAS) to design more efficient early-exit networks, aiming to reduce average latency while improving model accuracy by determining the best positions and number of exit branches in the architecture. Another important factor affecting the efficiency and accuracy of early-exit networks is the depth and types of layers in the exit branches. In this paper, we use hardware-aware NAS to strengthen exit branches, considering both accuracy and efficiency during optimization. Our performance evaluation on the CIFAR-10, CIFAR-100, and SVHN datasets demonstrates that our proposed framework, which considers varying depths and layers for exit branches along with adaptive threshold tuning, designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to the state-of-the-art approaches.
早期退出网络通过根据输入数据的复杂性调整计算量,是减少深度学习模型整体能耗和延迟的有效解决方案。通过在架构中引入中间退出分支,这些网络为简单样本提供较少的计算资源,在能源消耗至关重要的资源受限设备上特别有益。然而,设计早期退出网络是一个既具挑战又耗时的过程,因为它需要平衡效率与性能。最近的研究利用神经架构搜索(NAS)来设计更高效的早期退出网络,通过确定最佳的退出分支位置和数量,以减少平均延迟并提高模型精度。 影响早期退出网络效率和准确性的另一个重要因素是退出分支中的深度及层类型。在本文中,我们采用硬件感知型NAS技术来增强退出分支,在优化过程中同时考虑准确性与效率。我们在CIFAR-10、CIFAR-100和SVHN数据集上的性能评估表明,我们的框架通过综合考虑退出分支的多变深度和层类型以及自适应阈值调整,设计出的早期退出网络达到了比当前最先进方法相同或更低的平均MAC数(乘法累加操作),同时实现了更高的准确率。
https://arxiv.org/abs/2512.10671
Quantum circuit design is a key bottleneck for practical quantum machine learning on complex, real-world data. We present an automated framework that discovers and refines variational quantum circuits (VQCs) using graph-based Bayesian optimization with a graph neural network (GNN) surrogate. Circuits are represented as graphs and mutated and selected via an expected improvement acquisition function informed by surrogate uncertainty with Monte Carlo dropout. Candidate circuits are evaluated with a hybrid quantum-classical variational classifier on the next generation firewall telemetry and network internet of things (NF-ToN-IoT-V2) cybersecurity dataset, after feature selection and scaling for quantum embedding. We benchmark our pipeline against an MLP-based surrogate, random search, and greedy GNN selection. The GNN-guided optimizer consistently finds circuits with lower complexity and competitive or superior classification accuracy compared to all baselines. Robustness is assessed via a noise study across standard quantum noise channels, including amplitude damping, phase damping, thermal relaxation, depolarizing, and readout bit flip noise. The implementation is fully reproducible, with time benchmarking and export of best found circuits, providing a scalable and interpretable route to automated quantum circuit discovery.
量子电路设计是复杂现实世界数据上实用的量子机器学习的关键瓶颈。我们提出了一种自动化的框架,该框架使用基于图的贝叶斯优化和图神经网络(GNN)代理来发现并精炼变分量子电路(VQCs)。电路以图形表示,并通过预期改进获取函数进行变异和选择,该函数根据蒙特卡洛 dropout 确定的替代不确定性进行信息更新。候选电路在下一代防火墙遥测和网络物联网(NF-ToN-IoT-V2)网络安全数据集上使用混合量子经典变分分类器评估,在特征选择和量子嵌入缩放之后。 我们通过基于多层感知机(MLP)的代理、随机搜索和贪婪 GNN 选择与我们的管道进行基准测试。由 GNN 指导的优化器在所有基线中,始终找到复杂度较低且分类准确率相当或更优的电路。通过对包括幅度衰减、相位衰减、热松弛、去极化以及读出比特翻转噪声在内的标准量子噪声通道进行噪音研究来评估其鲁棒性。 该实现完全可重复,并提供时间基准测试和最佳发现电路的导出,为自动量子电路发现提供了可扩展且解释性强的方法。
https://arxiv.org/abs/2512.09586
Neural decoding, a critical component of Brain-Computer Interface (BCI), has recently attracted increasing research interest. Previous research has focused on leveraging signal processing and deep learning methods to enhance neural decoding performance. However, the in-depth exploration of model architectures remains underexplored, despite its proven effectiveness in other tasks such as energy forecasting and image classification. In this study, we propose NeuroSketch, an effective framework for neural decoding via systematic architecture optimization. Starting with the basic architecture study, we find that CNN-2D outperforms other architectures in neural decoding tasks and explore its effectiveness from temporal and spatial perspectives. Building on this, we optimize the architecture from macro- to micro-level, achieving improvements in performance at each step. The exploration process and model validations take over 5,000 experiments spanning three distinct modalities (visual, auditory, and speech), three types of brain signals (EEG, SEEG, and ECoG), and eight diverse decoding tasks. Experimental results indicate that NeuroSketch achieves state-of-the-art (SOTA) performance across all evaluated datasets, positioning it as a powerful tool for neural decoding. Our code and scripts are available at this https URL.
神经解码是脑机接口(BCI)的关键组成部分,近年来吸引了越来越多的研究兴趣。以往的研究主要集中在利用信号处理和深度学习方法来提升神经解码性能上。然而,尽管在诸如能源预测和图像分类等其他任务中已经证明了模型架构的重要性及其有效性,对模型架构的深入探索仍然相对较少。在这项研究中,我们提出了NeuroSketch框架,这是一个通过系统化架构优化实现有效神经解码的有效框架。 从基本架构的研究开始,我们发现CNN-2D在神经解码任务中的表现优于其他架构,并从时间和空间的角度探讨其有效性。在此基础上,我们将架构优化从宏观层面细化到微观层面,在每个步骤中都实现了性能的提升。探索过程和模型验证跨越了超过5,000次实验,涵盖了三种不同的模态(视觉、听觉和语音)、三种脑信号类型(EEG、SEEG 和 ECoG),以及八个多样化的解码任务。 实验结果表明,NeuroSketch在所有评估的数据集上都达到了最先进的性能水平,这使其成为神经解码的强大工具。我们的代码和脚本可以在提供的网址获取。
https://arxiv.org/abs/2512.09524
Our work introduces the DermETAS-SNA LLM Assistant that integrates Dermatology-focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM. The assistant dynamically learns skin-disease classifiers and provides medically informed descriptions to facilitate clinician-patient interpretation. Contributions include: (1) Developed an ETAS framework on the SKINCON dataset to optimize a Vision Transformer (ViT) tailored for dermatological feature representation and then fine-tuned binary classifiers for each of the 23 skin disease categories in the DermNet dataset to enhance classification performance; (2) Designed a StackNet architecture that integrates multiple fine-tuned binary ViT classifiers to enhance predictive robustness and mitigate class imbalance issues; (3) Implemented a RAG pipeline, termed Diagnostic Explanation and Retrieval Model for Dermatology, which harnesses the capabilities of the Google Gemini 2.5 Pro LLM architecture to generate personalized, contextually informed diagnostic descriptions and explanations for patients, leveraging a repository of verified dermatological materials; (4) Performed extensive experimental evaluations on 23 skin disease categories to demonstrate performance increase, achieving an overall F1-score of 56.30% that surpasses SkinGPT-4 (48.51%) by a considerable margin, representing a performance increase of 16.06%; (5) Conducted a domain-expert evaluation, with eight licensed medical doctors, of the clinical responses generated by our AI assistant for seven dermatological conditions. Our results show a 92% agreement rate with the assessments provided by our AI assistant (6) Created a proof-of-concept prototype that fully integrates our DermETAS-SNA LLM into our AI assistant to demonstrate its practical feasibility for real-world clinical and educational applications.
我们的工作介绍了一种名为DermETAS-SNA的LLM助手,它结合了以皮肤科为重点的进化Transformer架构搜索与StackNet增强的大规模语言模型。该助手能够动态学习皮肤疾病分类器,并提供医学信息描述,以便于医生和患者之间的沟通理解。 贡献包括: 1. 在SKINCON数据集上开发了一种ETAS框架,优化了一个针对皮肤病特征表示量身定制的Vision Transformer (ViT),并为DermNet数据集中23个皮肤疾病类别中的每一个进行了二元分类器的微调,以增强分类性能。 2. 设计了一种StackNet架构,该架构整合了多个经过微调的二元ViT分类器,以此来增强预测稳健性,并减轻类不平衡问题。 3. 实施了一个RAG管道,名为皮肤病诊断解释和检索模型(DiExRet),利用Google Gemini 2.5 Pro LLM架构生成个性化的、基于上下文的信息的诊断描述和解释给患者,利用一个经过验证的皮肤科材料库。 4. 在23个皮肤疾病类别上进行了广泛的实验评估,展示了性能提升,实现了整体F1分数为56.30%,比SkinGPT-4 (48.51%) 提高了相当大的幅度(提升了16.06%)。 5. 进行了一项领域专家评估,八位持证医生对我们的AI助手生成的七种皮肤病情况下的临床响应进行了评价。结果显示,与我们AI助手提供的评估有92%的一致性。 6. 创建了一个概念验证原型,将我们的DermETAS-SNA LLM完全整合到我们的AI助手中,以展示其在现实世界临床和教育应用中的实际可行性。 这些成果展示了我们在皮肤科诊断辅助工具方面的创新工作及其对提高医学实践效率的潜在贡献。
https://arxiv.org/abs/2512.08998