Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets. We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot. In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration. We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations. While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance. This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies. Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy. Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning.
深度神经网络中层规格(如层宽度)的优化构成了一个复杂的组合挑战,受到张量形状兼容性和计算预算的约束。我们认为大型语言模型 (LLM) 可以为神经架构搜索 (NAS) 提供一种变革性的方法,能够以传统启发式方法无法实现的方式推理关于架构代码结构的问题。在本文中,我们探讨了将基于 LLM 的 NAS 框架应用于通道配置问题的应用。我们将搜索形式化为一系列条件代码生成任务,在这些任务中,LLM 根据性能遥测数据来细化架构规范。关键地,我们通过抽象语法树(AST)变异生成大量有效且形状一致的架构来解决数据稀缺的问题。虽然这些变异网络未必是高性能的,但它们提供了 LLM 学习通道配置与模型表现之间潜在关系所需的关键结构数据量。这使得 LLM 能够内化复杂的设计模式,并将其应用于优化特征提取策略。 在 CIFAR-100 数据集上的实验结果验证了这种方法的有效性,证明该模型能够显著提高准确率(具有统计学意义)。我们的分析证实,LLM 成功地获得了特定领域的架构先验知识,这使本方法与随机搜索区分开来,并突显出语言驱动设计在深度学习中的巨大潜力。
https://arxiv.org/abs/2601.08517
The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.
在资源受限的边缘设备上部署基于Transformer的模型,对于实现实时人工智能应用构成了一个关键挑战。这篇全面的综述研究了专门为边缘部署设计的轻量级Transformer架构,并分析了最近关于模型压缩、量化、剪枝和知识蒸馏技术的进步。我们系统地回顾了一些著名的轻量级变体,包括MobileBERT、TinyBERT、DistilBERT、EfficientFormer、EdgeFormer和MobileViT,并在GLUE、SQuAD、ImageNet-1K和COCO等标准数据集上提供了详细的性能基准测试。我们的分析涵盖了主要硬件平台(如NVIDIA Jetson、Qualcomm Snapdragon、Apple Neural Engine及ARM架构)、部署框架(TensorFlow Lite、ONNX Runtime、PyTorch Mobile及CoreML)以及优化策略的当前行业采纳模式。 实验结果表明,现代轻量级Transformer可以在保持75%-96%全模型精度的同时,将模型大小减少4-10倍,推理延迟降低3-9倍,从而能够在低至2-5W功耗的设备上部署。我们确定了稀疏注意机制、混合精度量化(INT8/FP16)和硬件感知神经架构搜索作为最有效的优化策略。 新颖的研究发现包括对内存带宽瓶颈分析,显示具有15-40M参数的模型能够实现最佳的硬件利用效率(60%-75%),针对不同类型模型的最佳量化点,以及跨边缘平台的全面能效配置文件。我们建立了实时性能边界,并提供了一个实用的六步部署管道,在减少8-12倍大小的同时保证了不到2%的精度退化。
https://arxiv.org/abs/2601.03290
Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.
自动化神经网络架构设计在计算机视觉领域仍是一项重要挑战。任务多样性和计算约束要求既有效又高效的搜索方法。大型语言模型(LLMs)为计算密集型的神经架构搜索(NAS)提供了一种有希望的替代方案,但在计算机视觉架构生成的应用方面尚未系统地研究过,尤其是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架,这项工作引入并验证了两个关键贡献,以促进计算机视觉领域的进步。 首先,我们提出少样本架构提示(Few-Shot Architecture Prompting, FSAP),这是第一个关于LLM基架构生成所需的辅助示例数量(n=1,2,3,4,5,6)的系统性研究。研究表明使用n=3个例子可以在保持视觉任务多样性的同时最好地聚焦上下文。 其次,我们引入了空白标准化哈希验证方法,这是一种轻量级去重技术(少于1毫秒),相比抽象语法树解析提供了100倍的速度提升,并且可以防止重复训练相同的计算机视觉架构。在七个大规模的计算机视觉基准测试集(MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN和Places365)上,我们生成了超过1,900种独特架构。 此外,为了应对不同异构视觉任务之间比较架构的挑战,我们还提出了一种数据集平衡评估方法。这些贡献为基于LLM的计算机视觉领域架构搜索提供了切实可行的操作指南,并确立了严格的评价实践标准,使自动设计对于计算资源有限的研究人员来说更加可及。
https://arxiv.org/abs/2512.24120
This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~$\mu$J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.
这篇论文在严格的小型机器学习(TinyML)约束条件下,使用迭代硬件感知神经架构搜索框架,在TACO数据集上进行垃圾检测研究,该框架面向边缘计算和物联网设备。提出的方法构建了一个Once-for-All风格的ResDets超级网络,并通过交替优化骨干网和颈部/头部部分来进行迭代进化搜索,同时利用种群传递机制和准确度预测器来减少搜索成本并提高稳定性。这种框架生成了一组部署就绪的检测器,称为TrashDets。 在TACO的一个五类子集(纸、塑料、瓶子、罐头、烟蒂)上,最强版本TrashDet-l实现了30.5M参数和19.5 mAP50的性能,在使用明显更少的参数的同时,比之前的检测器提高了高达3.6 mAP50的准确性。TrashDets系列涵盖了从1.2M到30.5M不等的参数量,并且mAP50值在11.4和19.5之间变化,为资源受限硬件上的多样TinyML部署预算提供了可扩展的选择。 在配备了TrashNet数据集的MAX78002微控制器上,两种专门设计的变体——TrashDet-ResNet和TrashDet-MBNet——共同超越了ai87-fpndetector基准。其中,TrashDet-ResNet实现了每推理7525 μJ的能量消耗,在26.7毫秒延迟和37.45 FPS速度下工作;而TrashDet-MBNet将mAP50提高了10.2%。这两者相比现有的TinyML检测器,最多可以减少高达88%的能耗、78%的延迟,并且平均功率最多降低53%。
https://arxiv.org/abs/2512.20746
Evolutionary Neural Architecture Search (ENAS) has gained attention for automatically designing neural network architectures. Recent studies use a neural predictor to guide the process, but the high computational costs of gathering training data -- since each label requires fully training an architecture -- make achieving a high-precision predictor with { limited compute budget (i.e., a capped number of fully trained architecture-label pairs)} crucial for ENAS success. This paper introduces ENAS with Dual Contrastive Learning (DCL-ENAS), a novel method that employs two stages of contrastive learning to train the neural predictor. In the first stage, contrastive self-supervised learning is used to learn meaningful representations from neural architectures without requiring labels. In the second stage, fine-tuning with contrastive learning is performed to accurately predict the relative performance of different architectures rather than their absolute performance, which is sufficient to guide the evolutionary search. Across NASBench-101 and NASBench-201, DCL-ENAS achieves the highest validation accuracy, surpassing the strongest published baselines by 0.05\% (ImageNet16-120) to 0.39\% (NASBench-101). On a real-world ECG arrhythmia classification task, DCL-ENAS improves performance by approximately 2.5 percentage points over a manually designed, non-NAS model obtained via random search, while requiring only 7.7 GPU-days.
进化神经架构搜索(ENAS)因其能够自动设计神经网络结构而受到关注。近期的研究采用了神经预测器来指导这一过程,但由于收集训练数据所需的高昂计算成本——因为每个标签都需要对一个完整的架构进行完全训练——因此,在有限的计算预算内(即一定数量的完全训练过的架构-标签配对),实现高精度的预测对于ENAS的成功至关重要。 本文提出了一种名为具有双对比学习的ENAS (DCL-ENAS) 的新方法,该方法采用两个阶段的对比学习来训练神经预测器。在第一阶段,使用对比自监督学习从神经架构中学习有意义的表示,无需标签。在第二阶段,则通过对比学习进行微调,以准确预测不同架构之间的相对性能,而非绝对性能——这一点足以引导进化搜索过程。 在NASBench-101和NASBench-201两个基准测试上,DCL-ENAS实现了最高的验证精度,在ImageNet16-120和NASBench-101上的表现分别超越了最强的已发布基线模型0.05% 和 0.39%。在一项真实世界的心电图心律失常分类任务中,DCL-ENAS相较于手动设计且非NAS方法通过随机搜索获得的模型,在仅使用7.7个GPU天的情况下,提高了约2.5个百分点的性能。 此研究展示了对比学习技术如何优化神经架构搜索过程,并可能为未来的自动机器学习开发提供新的方向。
https://arxiv.org/abs/2512.20112
Thanks to the evolving network depth, convolutional neural networks (CNNs) have achieved remarkable success across various embedded scenarios, paving the way for ubiquitous embedded intelligence. Despite its promise, the evolving network depth comes at the cost of degraded hardware efficiency. In contrast to deep networks, shallow networks can deliver superior hardware efficiency but often suffer from inferior accuracy. To address this dilemma, we propose Double-Win NAS, a novel deep-to-shallow transformable neural architecture search (NAS) paradigm tailored for resource-constrained intelligent embedded systems. Specifically, Double-Win NAS strives to automatically explore deep networks to first win strong accuracy, which are then equivalently transformed into their shallow counterparts to further win strong hardware efficiency. In addition to search, we also propose two enhanced training techniques, including hybrid transformable training towards better training accuracy and arbitrary-resolution elastic training towards enabling natural network elasticity across arbitrary input resolutions. Extensive experimental results on two popular intelligent embedded systems (i.e., NVIDIA Jetson AGX Xavier and NVIDIA Jetson Nano) and two representative large-scale datasets (i.e., ImageNet and ImageNet-100) clearly demonstrate the superiority of Double-Win NAS over previous state-of-the-art NAS approaches.
得益于网络深度的演进,卷积神经网络(CNN)在各种嵌入式场景中取得了显著的成功,为无处不在的嵌入式智能铺平了道路。尽管其前景广阔,但随着网络深度的增加,硬件效率也随之下降。与深层网络相比,浅层网络可以提供更优的硬件效率,但却常常因为精度较低而逊色。为了应对这一困境,我们提出了Double-Win NAS,这是一种新颖的从深到浅可变换神经架构搜索(NAS)范式,专为资源受限的智能嵌入式系统设计。具体而言,Double-Win NAS致力于自动探索深层网络以首先获得强大的准确性,然后将其等效地转换为其浅层版本,从而进一步提高硬件效率。除了搜索之外,我们还提出了两种增强训练技术,包括混合可变换训练(旨在提高训练精度)和任意分辨率弹性训练(用于实现跨任意输入分辨率的自然网络弹性)。在两个流行的智能嵌入式系统(即NVIDIA Jetson AGX Xavier和NVIDIA Jetson Nano)以及两个代表性的大规模数据集(即ImageNet和ImageNet-100)上的大量实验结果清楚地表明,Double-Win NAS优于之前的最先进的NAS方法。
https://arxiv.org/abs/2512.19731
Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods
翻译: Transformer架构搜索(TAS)旨在自动发现高效的视觉变换器(ViTs),减少手动设计的需求。现有的TAS方法通常训练一个过度参数化的网络(即超网),该网络包含所有候选架构(即子网)。然而,所有的子网都共享同一组权重,这导致了干扰,严重影响较小的子网的表现。我们发现,经过良好训练的小型子网可以作为训练更大规模子网的良好基础。受此启发,我们提出了一种渐进式训练框架,称为GrowTAS,该框架从训练小型子网开始,并逐渐加入更大的子网。这有助于减少干扰并稳定训练过程。此外,我们还引入了GrowTAS+,它仅微调部分权重,以进一步提升大规模子网的表现。 在ImageNet和多个迁移学习基准测试(包括CIFAR-10/100、Flowers、Cars和INAT-19)上的广泛实验表明,我们的方法比当前的TAS方法更有效。
https://arxiv.org/abs/2512.12296
Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM's data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data "first principles" is more critical for achieving a superior architecture than simply retrieving SOTA components.
设计高性能的目标检测架构是一项复杂的任务,传统的手动设计耗时且劳动密集型,而神经架构搜索(NAS)则计算成本高昂。尽管最近使用大型语言模型(LLMs)的方法显示出前景,但它们通常作为搜索循环中的迭代优化器工作,而不是直接从数据的全面理解生成架构。为了解决这一差距,我们提出了Cognitive-YOLO,这是一种新的框架,通过该框架利用大型语言模型驱动的架构合成来直接从数据集的基本特征生成网络配置。 我们的方法分为三个阶段:首先,分析模块从目标数据集中提取关键元特征(例如对象尺度分布和场景密度);其次,LLM在这些特征的基础上进行推理,并借助检索增强生成(RAG)获取最先进的组件,以将架构合成为结构化的神经架构描述语言(NADL);最后,一个编译器将这种描述实例化为可部署的模型。 我们在五个不同的目标检测数据集上进行了广泛的实验,结果表明我们提出的Cognitive-YOLO能够一致地生成优于基准的架构,在多个基准测试中实现了具有竞争力的表现,并且在性能与参数之间的权衡方面表现更优。尤为重要的是,我们的消融研究证明了LLM的数据驱动推理是性能的主要驱动力,显示出对数据“基本原则”的深入了解比单纯检索最先进的组件对于实现优越架构更为关键。
https://arxiv.org/abs/2512.12281
Driver fatigue remains a leading cause of road accidents, with 24\% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6\% and mAP by 5\% over video-level supervision, achieving 99.34\% classification accuracy and 95.69\% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.
驾驶员疲劳仍然是导致道路事故的主要原因之一,24%的交通事故涉及嗜睡驾驶者。虽然打哈欠是疲劳早期的行为表现之一,但现有的机器学习方法在处理视频注释数据集时面临系统性噪声问题,这些噪声来自于粗糙的时间标注。为此,我们开发了一种半自动化的标签生成流程,并引入了人类验证环节(human-in-the-loop verification),并将此应用于YawDD数据集中,以实现更准确的模型训练。 通过这种改进的数据处理方式,在YawDD+上对已确立的MNasNet分类器和YOLOv11检测架构进行训练后,帧精度提高了多达6%,mAP(平均精度均值)则提升了5%。与视频级别的监督相比,这一方法达到了99.34%的分类准确率和95.69%的目标检测mAP。 该技术在边缘AI硬件(如NVIDIA Jetson Nano)上运行时,可实现每秒最多59.8帧的速度,这表明通过提高数据质量,可以在不依赖服务器端计算的情况下,在设备端实现实时监测驾驶员打哈欠。
https://arxiv.org/abs/2512.11446
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: this https URL
立体基础模型能够实现强大的零样本泛化能力,但计算成本过高,难以用于实时应用。而高效的立体架构则为了速度牺牲了鲁棒性,并且需要昂贵的领域特定微调。为弥合这一差距,我们提出了Fast-FoundationStereo,这是一个实现了迄今为止首个在实时帧率下达到强大零样本泛化的架构系列。我们采用了一种“分治”加速策略,包括三个组成部分:(1)知识蒸馏将混合骨干网络压缩成一个单一的高效学生模型;(2)块级神经架构搜索,在延迟预算下自动发现最优成本过滤设计,并大幅减少搜索复杂度;(3)结构化剪枝消除迭代细化模块中的冗余。此外,我们还引入了一个自动伪标签流水线,用于整理140万对野外立体图像以补充合成训练数据并促进知识蒸馏。结果模型比FoundationStereo运行速度快超过10倍,同时其零样本精度与后者紧密匹配,从而在实时方法中建立了新的最先进水平。 项目页面: [请在此处输入实际的链接或URL]
https://arxiv.org/abs/2512.11130
Early-exit networks are effective solutions for reducing the overall energy consumption and latency of deep learning models by adjusting computation based on the complexity of input data. By incorporating intermediate exit branches into the architecture, they provide less computation for simpler samples, which is particularly beneficial for resource-constrained devices where energy consumption is crucial. However, designing early-exit networks is a challenging and time-consuming process due to the need to balance efficiency and performance. Recent works have utilized Neural Architecture Search (NAS) to design more efficient early-exit networks, aiming to reduce average latency while improving model accuracy by determining the best positions and number of exit branches in the architecture. Another important factor affecting the efficiency and accuracy of early-exit networks is the depth and types of layers in the exit branches. In this paper, we use hardware-aware NAS to strengthen exit branches, considering both accuracy and efficiency during optimization. Our performance evaluation on the CIFAR-10, CIFAR-100, and SVHN datasets demonstrates that our proposed framework, which considers varying depths and layers for exit branches along with adaptive threshold tuning, designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to the state-of-the-art approaches.
早期退出网络通过根据输入数据的复杂性调整计算量,是减少深度学习模型整体能耗和延迟的有效解决方案。通过在架构中引入中间退出分支,这些网络为简单样本提供较少的计算资源,在能源消耗至关重要的资源受限设备上特别有益。然而,设计早期退出网络是一个既具挑战又耗时的过程,因为它需要平衡效率与性能。最近的研究利用神经架构搜索(NAS)来设计更高效的早期退出网络,通过确定最佳的退出分支位置和数量,以减少平均延迟并提高模型精度。 影响早期退出网络效率和准确性的另一个重要因素是退出分支中的深度及层类型。在本文中,我们采用硬件感知型NAS技术来增强退出分支,在优化过程中同时考虑准确性与效率。我们在CIFAR-10、CIFAR-100和SVHN数据集上的性能评估表明,我们的框架通过综合考虑退出分支的多变深度和层类型以及自适应阈值调整,设计出的早期退出网络达到了比当前最先进方法相同或更低的平均MAC数(乘法累加操作),同时实现了更高的准确率。
https://arxiv.org/abs/2512.10671
Quantum circuit design is a key bottleneck for practical quantum machine learning on complex, real-world data. We present an automated framework that discovers and refines variational quantum circuits (VQCs) using graph-based Bayesian optimization with a graph neural network (GNN) surrogate. Circuits are represented as graphs and mutated and selected via an expected improvement acquisition function informed by surrogate uncertainty with Monte Carlo dropout. Candidate circuits are evaluated with a hybrid quantum-classical variational classifier on the next generation firewall telemetry and network internet of things (NF-ToN-IoT-V2) cybersecurity dataset, after feature selection and scaling for quantum embedding. We benchmark our pipeline against an MLP-based surrogate, random search, and greedy GNN selection. The GNN-guided optimizer consistently finds circuits with lower complexity and competitive or superior classification accuracy compared to all baselines. Robustness is assessed via a noise study across standard quantum noise channels, including amplitude damping, phase damping, thermal relaxation, depolarizing, and readout bit flip noise. The implementation is fully reproducible, with time benchmarking and export of best found circuits, providing a scalable and interpretable route to automated quantum circuit discovery.
量子电路设计是复杂现实世界数据上实用的量子机器学习的关键瓶颈。我们提出了一种自动化的框架,该框架使用基于图的贝叶斯优化和图神经网络(GNN)代理来发现并精炼变分量子电路(VQCs)。电路以图形表示,并通过预期改进获取函数进行变异和选择,该函数根据蒙特卡洛 dropout 确定的替代不确定性进行信息更新。候选电路在下一代防火墙遥测和网络物联网(NF-ToN-IoT-V2)网络安全数据集上使用混合量子经典变分分类器评估,在特征选择和量子嵌入缩放之后。 我们通过基于多层感知机(MLP)的代理、随机搜索和贪婪 GNN 选择与我们的管道进行基准测试。由 GNN 指导的优化器在所有基线中,始终找到复杂度较低且分类准确率相当或更优的电路。通过对包括幅度衰减、相位衰减、热松弛、去极化以及读出比特翻转噪声在内的标准量子噪声通道进行噪音研究来评估其鲁棒性。 该实现完全可重复,并提供时间基准测试和最佳发现电路的导出,为自动量子电路发现提供了可扩展且解释性强的方法。
https://arxiv.org/abs/2512.09586
Neural decoding, a critical component of Brain-Computer Interface (BCI), has recently attracted increasing research interest. Previous research has focused on leveraging signal processing and deep learning methods to enhance neural decoding performance. However, the in-depth exploration of model architectures remains underexplored, despite its proven effectiveness in other tasks such as energy forecasting and image classification. In this study, we propose NeuroSketch, an effective framework for neural decoding via systematic architecture optimization. Starting with the basic architecture study, we find that CNN-2D outperforms other architectures in neural decoding tasks and explore its effectiveness from temporal and spatial perspectives. Building on this, we optimize the architecture from macro- to micro-level, achieving improvements in performance at each step. The exploration process and model validations take over 5,000 experiments spanning three distinct modalities (visual, auditory, and speech), three types of brain signals (EEG, SEEG, and ECoG), and eight diverse decoding tasks. Experimental results indicate that NeuroSketch achieves state-of-the-art (SOTA) performance across all evaluated datasets, positioning it as a powerful tool for neural decoding. Our code and scripts are available at this https URL.
神经解码是脑机接口(BCI)的关键组成部分,近年来吸引了越来越多的研究兴趣。以往的研究主要集中在利用信号处理和深度学习方法来提升神经解码性能上。然而,尽管在诸如能源预测和图像分类等其他任务中已经证明了模型架构的重要性及其有效性,对模型架构的深入探索仍然相对较少。在这项研究中,我们提出了NeuroSketch框架,这是一个通过系统化架构优化实现有效神经解码的有效框架。 从基本架构的研究开始,我们发现CNN-2D在神经解码任务中的表现优于其他架构,并从时间和空间的角度探讨其有效性。在此基础上,我们将架构优化从宏观层面细化到微观层面,在每个步骤中都实现了性能的提升。探索过程和模型验证跨越了超过5,000次实验,涵盖了三种不同的模态(视觉、听觉和语音)、三种脑信号类型(EEG、SEEG 和 ECoG),以及八个多样化的解码任务。 实验结果表明,NeuroSketch在所有评估的数据集上都达到了最先进的性能水平,这使其成为神经解码的强大工具。我们的代码和脚本可以在提供的网址获取。
https://arxiv.org/abs/2512.09524
Our work introduces the DermETAS-SNA LLM Assistant that integrates Dermatology-focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM. The assistant dynamically learns skin-disease classifiers and provides medically informed descriptions to facilitate clinician-patient interpretation. Contributions include: (1) Developed an ETAS framework on the SKINCON dataset to optimize a Vision Transformer (ViT) tailored for dermatological feature representation and then fine-tuned binary classifiers for each of the 23 skin disease categories in the DermNet dataset to enhance classification performance; (2) Designed a StackNet architecture that integrates multiple fine-tuned binary ViT classifiers to enhance predictive robustness and mitigate class imbalance issues; (3) Implemented a RAG pipeline, termed Diagnostic Explanation and Retrieval Model for Dermatology, which harnesses the capabilities of the Google Gemini 2.5 Pro LLM architecture to generate personalized, contextually informed diagnostic descriptions and explanations for patients, leveraging a repository of verified dermatological materials; (4) Performed extensive experimental evaluations on 23 skin disease categories to demonstrate performance increase, achieving an overall F1-score of 56.30% that surpasses SkinGPT-4 (48.51%) by a considerable margin, representing a performance increase of 16.06%; (5) Conducted a domain-expert evaluation, with eight licensed medical doctors, of the clinical responses generated by our AI assistant for seven dermatological conditions. Our results show a 92% agreement rate with the assessments provided by our AI assistant (6) Created a proof-of-concept prototype that fully integrates our DermETAS-SNA LLM into our AI assistant to demonstrate its practical feasibility for real-world clinical and educational applications.
我们的工作介绍了一种名为DermETAS-SNA的LLM助手,它结合了以皮肤科为重点的进化Transformer架构搜索与StackNet增强的大规模语言模型。该助手能够动态学习皮肤疾病分类器,并提供医学信息描述,以便于医生和患者之间的沟通理解。 贡献包括: 1. 在SKINCON数据集上开发了一种ETAS框架,优化了一个针对皮肤病特征表示量身定制的Vision Transformer (ViT),并为DermNet数据集中23个皮肤疾病类别中的每一个进行了二元分类器的微调,以增强分类性能。 2. 设计了一种StackNet架构,该架构整合了多个经过微调的二元ViT分类器,以此来增强预测稳健性,并减轻类不平衡问题。 3. 实施了一个RAG管道,名为皮肤病诊断解释和检索模型(DiExRet),利用Google Gemini 2.5 Pro LLM架构生成个性化的、基于上下文的信息的诊断描述和解释给患者,利用一个经过验证的皮肤科材料库。 4. 在23个皮肤疾病类别上进行了广泛的实验评估,展示了性能提升,实现了整体F1分数为56.30%,比SkinGPT-4 (48.51%) 提高了相当大的幅度(提升了16.06%)。 5. 进行了一项领域专家评估,八位持证医生对我们的AI助手生成的七种皮肤病情况下的临床响应进行了评价。结果显示,与我们AI助手提供的评估有92%的一致性。 6. 创建了一个概念验证原型,将我们的DermETAS-SNA LLM完全整合到我们的AI助手中,以展示其在现实世界临床和教育应用中的实际可行性。 这些成果展示了我们在皮肤科诊断辅助工具方面的创新工作及其对提高医学实践效率的潜在贡献。
https://arxiv.org/abs/2512.08998
Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR's classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.
神经架构搜索(NAS)传统上需要大量的专业知识或自动化试错来设计深度学习模型。我们提出了一个由大语言模型(LLM)引导的神经架构搜索流水线——NN-Caption,它通过组合从LEMUR分类骨干中提取的CNN编码器和序列解码器(LSTM/GRU/Transformer),在严格的Net API下生成可运行的图像描述模型。使用DeepSeek-R1-0528-Qwen3-8B作为主要生成器,我们展示了提示模板以及生成架构的例子。我们在MS COCO数据集上进行了BLEU-4评估,结果显示LLM生成了数十个描述模型,其中一半以上成功训练并产生了有意义的描述。我们分析了在提示中使用不同数量的输入模型片段(5与10)对成功率的影响,发现提供更多候选组件会导致成功率略有下降。 此外,我们报告了训练动态(随时间变化的描述准确性)和最高获得的BLEU-4分数。我们的研究结果强调了由LLM引导NAS的潜力:LLM不仅提出架构建议,还提出了超参数设置和训练实践方面的建议。我们识别了遇到的一些挑战(如代码幻觉或API兼容性问题),并详细说明了如何通过提示规则以及迭代代码修复来解决这些问题。 这项工作提供了一个集成基于提示的代码生成与自动评估的流水线,并向开放的LEMUR数据集增加了数十个新颖的描述模型,以促进可重复性的基准测试和下游自动化机器学习研究。
https://arxiv.org/abs/2512.14706
Advancements in high-performance computing and cloud technologies have enabled the development of increasingly sophisticated Deep Learning (DL) models. However, the growing demand for embedded intelligence at the edge imposes stringent computational and energy constraints, challenging the deployment of these large-scale models. Early Exiting Neural Networks (EENN) have emerged as a promising solution, allowing dynamic termination of inference based on input complexity to enhance efficiency. Despite their potential, EENN performance is highly influenced by the heterogeneity of edge accelerators and the constraints imposed by quantization, affecting accuracy, energy efficiency, and latency. Yet, research on the automatic optimization of EENN design for edge hardware remains limited. To bridge this gap, we propose a hardware-aware Neural Architecture Search (NAS) framework that systematically integrates the effects of quantization and hardware resource allocation to optimize the placement of early exit points within a network backbone. Experimental results on the CIFAR-10 dataset demonstrate that our NAS framework can discover architectures that achieve over a 50\% reduction in computational costs compared to conventional static networks, making them more suitable for deployment in resource-constrained edge environments.
高性能计算和云技术的进步推动了深度学习(DL)模型的快速发展。然而,随着对边缘设备嵌入式智能的需求日益增长,严格的计算和能源限制给大型模型的应用部署带来了挑战。早期退出神经网络(EENN)作为一种有潜力的解决方案应运而生,它通过根据输入复杂性动态终止推理过程来提高效率。尽管如此,EENN的表现受到边缘加速器异构性和量化约束的影响,这些因素影响了准确性、能源效率和延迟。但是,针对边缘硬件自动优化EENN设计的研究仍然有限。 为了解决这一问题,我们提出了一种硬件感知的神经架构搜索(NAS)框架,该框架系统地整合了量化的效应以及硬件资源分配对早期退出点在网络骨干中定位的影响。在CIFAR-10数据集上的实验结果表明,我们的NAS框架能够发现比传统静态网络计算成本减少超过50%的架构,使其更适合部署于资源受限的边缘环境中。
https://arxiv.org/abs/2512.04705
The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
《段落任何模型(SAM)》作为图像分割的强大视觉基础模型已经出现。然而,将SAM适应特定下游任务(如医学和农业成像)仍然是一项重大挑战。为此,低秩适配(LoRA)及其变体已被广泛用于增强SAM在不同领域的适应性能。尽管取得了进展,但仍有一个关键问题亟待解决:我们能否将归纳偏置整合到模型中?这尤其重要,因为SAM中的Transformer编码器本身缺乏图像补丁内的空间先验信息,可能妨碍高级语义信息的获取。 为此,在本文中,我们提出了NAS-LoRA,这是一种新的参数高效微调(PEFT)方法,旨在弥合预训练SAM与专业化领域之间的语义差距。具体而言,NAS-LoRA在LoRA的编码器和解码器组件之间引入了一个轻量级的神经架构搜索(NAS)模块,以动态优化融入权重更新中的先验知识。此外,我们提出了一种分阶段优化策略,帮助ViT编码器平衡权重更新与架构调整,从而逐步学习高级语义信息。 各种实验表明,我们的NAS-LoRA方法改进了现有的PEFT方法,并且在不增加推理成本的情况下,将训练成本降低了24.14%,这突显了NAS在增强视觉基础模型的PEFT方面的潜力。
https://arxiv.org/abs/2512.03499
Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.
基础模型已在自然语言处理(NLP)和计算机视觉(CV)等领域引发了革命。尽管人们努力将这些基础模型的成功扩展到生物学领域,但现有的工作主要集中在直接采用通用机器学习领域的现有基础模型架构上,而没有系统地考虑到每种生物数据模态所具有的独特物理化学和结构特性。这种做法导致了次优性能,因为重新调整用途的架构难以捕捉到生物数据中固有的长距离依赖、稀疏信息以及复杂的底层“语法”。 为了解决这一差距,我们引入了一个名为BioArc的新框架,该框架旨在超越直觉驱动的架构设计,转向为生物学基础模型进行原理化和自动化的架构发现。通过利用神经架构搜索(NAS),BioArc系统地探索了庞大的架构设计空间,并在多种生物模态下评估各种架构的同时严谨分析架构、标记化与训练策略之间的相互作用。这种大规模分析识别出新型高效率的架构,使我们能够提炼出一系列实证的设计原则来指导未来模型的发展。 此外,为了最大化利用这一组发现的原则性架构,我们提出了并比较了几种有效的架构预测方法,这些方法可以有效地为新的生物任务预测最优架构。总的来说,我们的工作提供了一个基础资源和一种原理化的方法论,以引导下一代特定任务的和基础生物学模型的创建。
https://arxiv.org/abs/2512.00283
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, this http URL, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.
https://arxiv.org/abs/2511.23404
Deep learning models for breast cancer detection from mammographic images have significant reliability problems when presented with Out-of-Distribution (OOD) inputs such as other imaging modalities (CT, MRI, X-ray) or equipment variations, leading to unreliable detection and misdiagnosis. The current research mitigates the fundamental OOD issue through a comprehensive approach integrating ResNet50-based OOD filtering with YOLO architectures (YOLOv8, YOLOv11, YOLOv12) for accurate detection of breast cancer. Our strategy establishes an in-domain gallery via cosine similarity to rigidly reject non-mammographic inputs prior to processing, ensuring that only domain-associated images supply the detection pipeline. The OOD detection component achieves 99.77\% general accuracy with immaculate 100\% accuracy on OOD test sets, effectively eliminating irrelevant imaging modalities. ResNet50 was selected as the optimum backbone after 12 CNN architecture searches. The joint framework unites OOD robustness with high detection performance (mAP@0.5: 0.947) and enhanced interpretability through Grad-CAM visualizations. Experimental validation establishes that OOD filtering significantly improves system reliability by preventing false alarms on out-of-distribution inputs while maintaining higher detection accuracy on mammographic data. The present study offers a fundamental foundation for the deployment of reliable AI-based breast cancer detection systems in diverse clinical environments with inherent data heterogeneity.
用于从乳腺X线摄影图像中检测乳腺癌的深度学习模型在面对与训练数据不同的“Out-of-Distribution”(OOD,非预期分布)输入时存在显著可靠性问题。这类OOD输入包括其他成像模式(如CT、MRI、X射线)或设备变化等,这些问题会导致检测不可靠和误诊。目前的研究通过一种综合性方法解决了这一根本的OOD问题,该方法结合了基于ResNet50的OOD过滤器与YOLO架构(包括YOLOv8、YOLOv11和YOLOv12),以实现乳腺癌的精确检测。 我们的策略通过余弦相似度建立了域内图像库,并严格拒绝非乳腺X线摄影输入,确保只有相关领域的图像才能进入检测流程。OOD检测组件实现了99.77%的一般准确率,在OOD测试集上达到了完美的100%准确率,有效地排除了其他成像模式的干扰。 经过12次CNN架构搜索后,选择了ResNet50作为最优骨干网络。该联合框架结合了OOD鲁棒性、高检测性能(mAP@0.5:0.947)以及通过Grad-CAM可视化增强的可解释性。实验验证表明,OOD过滤显著提高了系统的可靠性,减少了对非预期输入产生误报的情况,同时保持了乳腺X线摄影数据上的高检测准确性。 本研究为在存在数据异质性的多样临床环境中部署可靠的人工智能乳腺癌检测系统提供了基础框架。
https://arxiv.org/abs/2512.00129