Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at this https URL.
https://arxiv.org/abs/2604.12152
As Earth-observation workloads move toward onboard and edge processing, remote-sensing segmentation models must operate under tight latency and energy constraints. We present SatReg, a regression-based hardware-aware tuning framework for lightweight remote-sensing segmentation on edge platforms. Using CM-UNet as the teacher architecture, we reduce the search space to two dominant width-related variables, profile a small set of student models on an NVIDIA Jetson Orin Nano, and fit low-order surrogate models for mIoU, latency, and power. Knowledge distillation is used to efficiently train the sampled students. The learned surrogates enable fast selection of near-optimal architecture settings for deployment targets without exhaustive search. Results show that the selected variables affect task accuracy and hardware cost differently, making reduced-space regression a practical strategy for adapting hybrid CNN-Mamba segmentation models to future space-edge systems.
https://arxiv.org/abs/2604.10306
The growing complexity of visuomotor policies poses significant challenges for deployment with heterogeneous robotic hardware constraints. However, most existing model-efficient approaches for robotic manipulation are device- and model-specific, lack generalizability, and require time-consuming per-device optimization during the adaptation process. In this work, we propose a unified framework named \textbf{D}evice-\textbf{C}onditioned \textbf{Q}uantization-\textbf{F}or-\textbf{A}ll (DC-QFA) which amortizes deployment effort with the device-conditioned quantization-aware training and hardware-constrained architecture search. Specifically, we introduce a single supernet that spans a rich design space over network architectures and mixed-precision bit-widths. It is optimized with latency- and memory-aware regularization, guided by per-device lookup tables. With this supernet, for each target platform, we can perform a once-for-all lightweight search to select an optimal subnet without any per-device re-optimization, which enables more generalizable deployment across heterogeneous hardware, and substantially reduces deployment time. To improve long-horizon stability under low precision, we further introduce multi-step on-policy distillation to mitigate error accumulation during closed-loop execution. Extensive experiments on three representative policy backbones, such as DiffusionPolicy-T, MDT-V, and OpenVLA-OFT, demonstrate that our DC-QFA achieves $2\text{-}3\times$ acceleration on edge devices, consumer-grade GPUs, and cloud platforms, with negligible performance drop in task success. Real-world evaluations on an Inovo robot equipped with a force/torque sensor further validates that our low-bit DC-QFA policies maintain stable, contact-rich manipulation even under severe quantization.
https://arxiv.org/abs/2604.10170
We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe, measuring how stably Ouro's own learned value system organizes its representations rather than how well it predicts noisy human annotations. We also document a systematic architecture search that established a genuine 70% ceiling for independent scoring, and show how the 50% argument-swap protocol required to prevent degenerate pairwise solutions deflated pairwise training metrics by about 31 points at peak, creating the false appearance that pairwise and pointwise evaluators shared the same ceiling. Finally, we show that a cosine learning-rate dead zone at epoch 2 accidentally acted as early stopping, preserving the generalization peak before overfitting degraded test accuracy from 95.2% to 62.4% by epoch 5. Cross-epoch flip-test analysis shows that antisymmetry correlation remains stable while strict sign-flip rate mainly tracks scorer bias. We propose the flip test as a mandatory diagnostic for pairwise preference evaluators.
https://arxiv.org/abs/2604.09870
Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at this https URL
https://arxiv.org/abs/2604.06938
Developmental approaches to neural architecture search grow functional networks from compact genomes through self-organisation, but the resulting networks operate with fixed post-growth weights. We characterise Hebbian and anti-Hebbian plasticity across 50,000 morphogenetically grown recurrent controllers (5M+ configurations on CartPole and Acrobot), then test whether co-evolutionary experiments -- where plasticity parameters are encoded in the genome and evolved alongside the developmental architecture -- recover these patterns independently. Our characterisation reveals that (1) anti-Hebbian plasticity significantly outperforms Hebbian for competent networks (Cohen's d = 0.53-0.64), (2) regret (fraction of oracle improvement lost under the best fixed setting) reaches 52-100%, and (3) plasticity's role shifts from fine-tuning to genuine adaptation under non-stationarity. Co-evolution independently discovers these patterns: on CartPole, 70% of runs evolve anti-Hebbian plasticity (p = 0.043); on Acrobot, evolution finds near-zero eta with mixed signs -- exactly matching the characterisation. A random-RNN control shows that anti-Hebbian dominance is generic to small recurrent networks, but the degree of topology-dependence is developmental-specific: regret is 2-6x higher for morphogenetically grown networks than for random graphs with matched topology statistics.
神经架构搜索的发展方法通过自组织从紧凑基因组中生成功能网络,但生成的网络使用固定的生长后权重运行。我们在5万个形态发生生长的循环控制器(在CartPole和Acrobot上超过500万个配置)中刻画赫布与反赫布可塑性,并检验协同进化实验——其中可塑性参数被编码在基因组中并与发育架构共同进化——是否能独立复现这些模式。我们的刻画揭示:(1) 对于胜任的网络,反赫布可塑性显著优于赫布可塑性(Cohen's d = 0.53-0.64);(2) 遗憾(在最佳固定设置下损失的神谕改进比例)达到52-100%;(3) 在非平稳条件下,可塑性的角色从微调转变为真正的适应。协同进化独立发现了这些模式:在CartPole上,70%的运行进化出反赫布可塑性(p = 0.043);在Acrobot上,进化发现接近零的eta且符号混合——与刻画完全匹配。随机RNN对照表明,反赫布主导性对小规模循环网络具有普遍性,但拓扑依赖程度具有发育特异性:形态发生生长的网络的遗憾比拓扑统计匹配的随机图高2-6倍。
https://arxiv.org/abs/2604.03386
Spacecraft anomaly detection is critical for mission safety, yet deploying sophisticated models on-board presents significant challenges due to hardware constraints. This paper investigates three approaches for spacecraft telemetry anomaly detection -- forecasting & threshold, direct classification, and image classification -- and optimizes them for edge deployment using multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset. Our baseline experiments demonstrate that forecasting & threshold achieves superior detection performance (92.7% Corrected Event-wise F0.5-score (CEF0.5)) [1] compared to alternatives. Through Pareto-optimal architecture optimization, we dramatically reduced computational requirements while maintaining capabilities -- the optimized forecasting & threshold model preserved 88.8% CEF0.5 while reducing RAM usage by 97.1% to just 59 KB and operations by 99.4%. Analysis of deployment viability shows our optimized models require just 0.36-6.25% of CubeSat RAM, making on-board anomaly detection practical even on highly constrained hardware. This research demonstrates that sophisticated anomaly detection capabilities can be successfully deployed within spacecraft edge computing constraints, providing near-instantaneous detection without exceeding hardware limitations or compromising mission safety.
航天器异常检测对任务安全至关重要,然而受硬件限制,在星上部署复杂模型仍面临巨大挑战。本研究针对航天器遥测异常检测,探究了预测与阈值法、直接分类法及图像分类法三种方案,并利用多目标神经架构优化技术,在欧洲空间局异常数据集上优化其边缘部署性能。基线实验表明,预测与阈值法的检测效果最优(修正事件级F0.5分数达92.7%)[1]。通过帕累托最优架构优化,我们在保持检测能力的同时大幅降低计算需求——优化后的预测与阈值模型在保留88.8%修正事件级F0.5分数的前提下,将内存占用减少97.1%至仅59 KB,运算量降低99.4%。部署可行性分析显示,优化后模型仅需0.36%-6.25%的立方星内存,使星上异常检测即使在高度受限的硬件上也可行。本研究证明,复杂的异常检测能力可成功部署于航天器边缘计算约束内,实现近瞬时检测且不超出硬件限制或损害任务安全。
https://arxiv.org/abs/2603.29375
Integrating quantum circuits into deep learning pipelines remains challenging due to heuristic design limitations. We propose Q-DIVER, a hybrid framework combining a large-scale pretrained EEG encoder (DIVER-1) with a differentiable quantum classifier. Unlike fixed-ansatz approaches, we employ Differentiable Quantum Architecture Search to autonomously discover task-optimal circuit topologies during end-to-end fine-tuning. On the PhysioNet Motor Imagery dataset, our quantum classifier achieves predictive performance comparable to classical multi-layer perceptrons (Test F1: 63.49\%) while using approximately \textbf{50$\times$ fewer task-specific head parameters} (2.10M vs. 105.02M). These results validate quantum transfer learning as a parameter-efficient strategy for high-dimensional biological signal processing.
将量子电路集成到深度学习流程中仍受限于启发式设计的局限性。我们提出Q-DIVER混合框架,该框架结合大规模预训练脑电图编码器(DIVER-1)与可微分量子分类器。不同于固定架构方法,我们采用可微分量子架构搜索,在端到端微调过程中自主发现任务最优电路拓扑结构。在PhysioNet运动想象数据集上,我们的量子分类器实现了与经典多层感知器相当的预测性能(测试F1值:63.49%),同时任务特定头部参数使用量减少约\textbf{50倍}(2.10M vs. 105.02M)。这些结果验证了量子迁移学习作为高维生物信号处理中参数高效策略的可行性。
https://arxiv.org/abs/2603.28122
Deep learning models for drug-like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with <1% degradation, indicating that the differences reflect search-path dependence rather than fundamental biological requirements. We release a decision framework and open-source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.
针对类药分子和蛋白质的深度学习模型,绝大多数直接复用为自然语言设计的Transformer架构,但分子序列是否能从不同架构中获益,尚缺乏系统性验证。我们通过智能体在三种序列类型(SMILES、蛋白质及作为对照的英文文本)上开展自主架构搜索,在单张GPU上运行了3,106次实验。对于SMILES,架构搜索适得其反:仅调整学习率及其调度策略就已优于完整搜索(p = 0.001)。对于自然语言,架构改进贡献了81%的性能提升(p = 0.009)。蛋白质则介于两者之间。令人惊讶的是,尽管智能体在各领域发现了不同的架构(p = 0.004),但每个创新都能跨所有领域迁移,性能损失不足1%,这表明差异源于搜索路径依赖,而非根本性的生物学需求。我们发布了一套决策框架及开源工具包,帮助分子建模团队在自主架构搜索与简单超参数调优之间做出选择。
https://arxiv.org/abs/2603.28015
In the context of algorithms for problem solving, procedural knowledge -- the know-how of algorithm design and operator composition -- remains implicit in code, lost between runs, and must be re-engineered for each new domain. Knowledge graphs (KGs) have proven effective for organizing declarative knowledge, yet current KG paradigms provide limited support for representing procedural knowledge as executable, learnable graph structures. We introduce \textit{Generative Executable Algorithm Knowledge Graphs} (GEAKG), a class of KGs whose nodes store executable operators, whose edges encode learned composition patterns, and whose traversal generates solutions. A GEAKG is \emph{generative} (topology and operators are synthesized by a Large Language Model), \emph{executable} (every node is runnable code), and \emph{transferable} (learned patterns generalize zero-shot across domains). The framework is domain-agnostic at the engine level: the same three-layer architecture and Ant Colony Optimization (ACO)-based learning engine can be instantiated across domains, parameterized by a pluggable ontology (\texttt{RoleSchema}). Two case studies -- sharing no domain-specific framework code -- provide concrete evidence for this framework hypothesis: (1)~Neural Architecture Search across 70 cross-dataset transfer pairs on two tabular benchmarks, and (2)~Combinatorial Optimization, where knowledge learned on the Traveling Salesman Problem transfers zero-shot to scheduling and assignment domains. Taken together, the results support that algorithmic expertise can be explicitly represented, learned, and transferred as executable knowledge graphs.
在问题求解算法的背景下,过程性知识——即算法设计与操作符组合的实操技能——仍隐含在代码中,在运行间丢失,且必须为每个新领域重新设计。知识图谱(KGs)已被证明能有效组织陈述性知识,但当前KG范式在将过程性知识表示为可执行、可学习的图结构方面支持有限。我们提出了\textit{生成式可执行算法知识图谱}(GEAKG),这是一类KG,其节点存储可执行操作符,边编码习得的组合模式,遍历即可生成解决方案。GEAKG是\textit{生成式}的(拓扑结构与操作符由大语言模型合成)、\textit{可执行}的(每个节点均为可运行代码)且\textit{可迁移}的(习得模式能跨领域零样本泛化)。该框架在引擎层面是领域无关的:相同的三层架构与基于蚁群优化(ACO)的学习引擎可实例化于不同领域,通过可插拔的本体(\texttt{RoleSchema})进行参数化。两项不共享任何领域特定框架代码的案例研究为该框架假设提供了具体证据:(1) 在两个表格基准测试上进行的跨70个数据集迁移配对的神经架构搜索,以及(2) 组合优化,其中在旅行商问题上学到的知识可零样本迁移至调度与分配领域。综合而言,这些结果支持了算法专长可被显式表示为可执行知识图谱、从中学习并实现迁移的结论。
https://arxiv.org/abs/2603.27922
The cislunar regime departs from near-Earth orbital behavior through strongly non-linear, non-Keplerian dynamics, which adversely affect the accuracy of uncertainty propagation and state estimation. Additional challenges arise from long-range observation requirements, restrictive sensor-target geometry and illumination conditions, the need to monitor an expansive cislunar volume, and the large design space associated with space/ground-based sensor placement. In response to these challenges, this work introduces an advanced framework for cislunar space domain awareness (SDA) encompassing two key tasks: (1) observer architecture optimization based on a realistic cost formulation that captures key performance trade-offs, solved using the Tree of Parzen Estimators algorithm, and (2) leveraging the resulting observer architecture, a mutual information-driven sensor tasking optimization is performed at discrete tasking intervals, while orbital and attitude state estimation is carried out at a finer temporal resolution between successive tasking updates using an error-state multiplicative unscented Kalman filter. Numerical simulations demonstrate that our approach in Task 1 yields observer architectures that achieve significantly lower values of the proposed cost function than baseline random-search solutions, while using fewer sensors. Task 2 results show that translational state estimation remains satisfactory over a wide range of target-to-observer count ratios, whereas attitude estimation is significantly more sensitive to target-to-observer ratios and tasking intervals, with increased rotational-state divergence observed for high target counts and infrequent tasking updates. These results highlight important trade-offs between sensing resources, tasking cadence, and achievable state estimation performance that influence the scalability of autonomous cislunar SDA.
地月空间态势感知体系因强非线性、非开普勒动力学而偏离近地轨道行为特性,这种动力学特性会不利影响不确定性传播与状态估计的精度。此外还面临诸多挑战:远距离观测需求严苛、传感器-目标几何构型与光照条件受限、需监控广阔的地月空间范围,以及天基/地基传感器部署方案存在巨大的设计空间。针对这些挑战,本研究提出一种先进的地月空间态势感知框架,涵盖两项核心任务:(1) 基于真实成本模型的观测器架构优化——该模型捕捉关键性能权衡关系,采用帕岑树估计算法求解;(2) 利用优化后的观测器架构,在离散调度周期执行互信息驱动的传感器调度优化,而在连续调度更新之间采用误差状态乘法无迹卡尔曼滤波器进行更高时间分辨率的轨道与姿态状态估计。数值仿真表明:任务1的方法能生成显著优于基线随机搜索方案的观测器架构,在实现更低成本函数值的同时减少传感器数量;任务2结果显示,平移状态估计在广泛的目标-观测器数量比范围内保持良好,但姿态估计对目标-观测器数量比及调度间隔更为敏感——当目标数量较多且调度更新频率较低时,旋转状态发散现象会加剧。这些结果揭示了感知资源、调度频率与可达成状态估计性能之间的关键权衡关系,对自主地月空间态势感知系统的可扩展性具有重要影响。
https://arxiv.org/abs/2603.20579
We introduce layered Quantum Architecture Search (layered-QAS), a strategy inspired by classical network morphism that designs Parametrised Quantum Circuit (PQC) architectures by progressively growing and adapting them. PQCs offer strong expressiveness with relatively few parameters, yet they lack standard architectural layers (e.g., convolution, attention) that encode inductive biases for a given learning task. To assess the effectiveness of our method, we focus on 3D point cloud classification as a challenging yet highly structured problem. Whereas prior work on this task has used PQCs only as feature extractors for classical classifiers, our approach uses the PQC as the main building block of the classification model. Simulations show that our layered-QAS mitigates barren plateau, outperforms quantum-adapted local and evolutionary QAS baselines, and achieves state-of-the-art results among PQC-based methods on the ModelNet dataset.
我们提出分层量子架构搜索(layered-QAS),一种受经典网络形态学启发的策略,通过逐步增长和调整来设计参数化量子电路(PQC)架构。PQC以较少参数实现强表达能力,但缺乏针对特定学习任务编码归纳偏置的标准架构层(如卷积、注意力)。为评估方法有效性,我们聚焦于3D点云分类这一具有挑战性且高度结构化的问题。此前工作在此任务中仅将PQC用作经典分类器的特征提取器,而我们的方法将PQC作为分类模型的核心构建模块。仿真表明,我们的layered-QAS能缓解 barren plateau 问题,优于量子适配的局部与进化式QAS基线,并在ModelNet数据集上达到基于PQC方法的最先进结果。
https://arxiv.org/abs/2603.20024
Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at this https URL
现代计算机视觉需要在预测精度与实时效率之间取得平衡,但大型视觉模型(LVMs)高昂的推理成本限制了其在资源受限的边缘设备上的部署。尽管进化神经架构搜索(ENAS)非常适合多目标优化,但其实际应用受限于两个问题:候选架构评估成本高昂,以及子网络间排名不一致。为解决这些问题,我们提出了EvoNAS,一个高效的多目标进化架构搜索分布式框架。我们构建了一个融合视觉状态空间与视觉Transformer(VSS-ViT)模块的混合超网络,并采用跨架构双域知识蒸馏(CA-DDKD)策略对其进行优化。通过耦合VSS模块的计算效率与ViT模块的语义表达能力,CA-DDKD提升了共享超网络的表征能力并增强了排名一致性,从而在无需额外微调的情况下实现进化过程中可靠的适应度估计。为降低大规模验证成本,我们进一步引入了基于GPU资源池与异步调度的分布式多模型并行评估(DMMPE)框架。相较于传统的数据并行评估,DMMPE通过多GPU、多模型并发执行将效率提升超过70%。在COCO、ADE20K、KITTI和NYU-Depth v2上的实验表明,所搜索的架构(称为EvoNets)持续实现了精度与效率之间的帕累托最优权衡。与代表性的CNN、ViT和Mamba基础模型相比,EvoNets在严格的计算预算下实现了更低的推理延迟和更高的吞吐量,同时在新视角合成等下游任务中保持了强大的泛化能力。代码已公开于此https链接。
https://arxiv.org/abs/2603.19563
Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in < 30 seconds with < 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN's deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.
https://arxiv.org/abs/2603.18306
Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.
https://arxiv.org/abs/2603.15954
Applying machine learning to sensitive time-series data is often bottlenecked by the iteration loop: Performance depends strongly on preprocessing and architecture, yet training often has to run on-premise under strict data-local constraints. This is a common problem in healthcare and other privacy-constrained domains (e.g., a hospital developing deep learning models on patient EEG). This bottleneck is particularly challenging in multimodal fusion, where sensor modalities must be individually preprocessed and then combined. LLM-guided neural architecture search (NAS) can automate this exploration, but most existing workflows assume cloud execution or access to data-derived artifacts that cannot be exposed. We present a novel data-local, LLM-guided search framework that handles candidate pipelines remotely while executing all training and evaluation locally under a fixed protocol. The controller observes only trial-level summaries, such as pipeline descriptors, metrics, learning-curve statistics, and failure logs, without ever accessing raw samples or intermediate feature representations. Our framework targets multiclass, multimodal learning via one-vs-rest binary experts per class and modality, a lightweight fusion MLP, and joint search over expert architectures and modality-specific preprocessing. We evaluate our method on two regimes: UEA30 (public multivariate time-series classification dataset) and SleepEDFx sleep staging (heterogeneous clinical modalities such as EEG, EOG, and EMG). The results show that the modular baseline model is strong, and the LLM-guided NAS further improves it. Notably, our method finds models that perform within published ranges across most benchmark datasets. Across both settings, our method reduces manual intervention by enabling unattended architecture search while keeping sensitive data on-premise.
https://arxiv.org/abs/2603.15939
When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} ($F = 1324$, $\eta^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at $N = 50$, LLM-guided search reaches AP $= 0.985$ versus $0.965$ for from-scratch random search. Post-bugfix convergence follows a power law ($c = 0.11$, $R^2 = 0.93$); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.
https://arxiv.org/abs/2603.15916
Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.
https://arxiv.org/abs/2603.15106
Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.
神经架构搜索(NAS)自动化了网络设计,但传统方法需要大量的计算资源。我们提出了一种闭环流水线,利用大型语言模型(LLMs)在单个消费级GPU上无需对LLM进行微调的情况下,迭代生成、评估和优化卷积神经网络架构以进行图像分类。我们的方法的核心是一种受马尔可夫链启发的历史反馈记忆:一个宽度为$K=5$的滑动窗口保持上下文大小恒定,同时提供足够的信号来进行迭代学习。与之前的LLM优化器丢弃失败路径不同,每个历史条目都是一个结构化的诊断三元组——记录识别的问题、建议的修改以及结果——将代码执行失败视为首要的学习信号。双LLM专业化减少了每次调用的认知负荷:一个代码生成器生产可执行的PyTorch架构,而提示改进者处理诊断推理。由于LLM和架构训练共享有限的VRAM资源,搜索过程隐式地倾向于紧凑、硬件高效的模型,适合边缘部署。 我们在约束最少的开放代码空间中评估了三个冻结指令调优的LLMs(≤70亿参数),最多进行2000次迭代,并使用CIFAR-10、CIFAR-100和ImageNette上一个epoch的代理准确度作为快速排名信号。在CIFAR-10数据集上,DeepSeek-Coder-6.7B从28.2%提高到了69.2%,Qwen2.5-7B从50.0%提升到71.5%,而GLM-5从43.2%升至62.0%。一个完整的2000次迭代搜索在单个RTX 4090 GPU上大约需要18小时,从而建立了低成本、可重复和硬件感知的LLM驱动NAS范式,无需云基础设施。 这种方法提供了显著的成本效益,并且为在边缘设备上部署高效的神经网络架构开辟了新的可能性。
https://arxiv.org/abs/2603.12091
Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures.
神经架构搜索(NAS)在目标检测领域由于评估成本高昂而面临严重瓶颈,因为完全训练每个候选YOLO架构需要大量的GPU时间。同时,现有的NAS基准测试主要针对图像分类问题,导致目标检测社区缺乏一个相应的NAS性能评估基准。为解决这一差距,我们引入了 YOLO-NAS-Bench,这是第一个专为目标检测(特别是类似YOLO的模型)设计的替代性基准。 **YOLO-NAS-Bench** 定义了一个搜索空间,包括通道宽度、块深度和操作类型等参数,这些覆盖了从YOLOv8到YOLO12的核心模块。我们通过随机采样、分层抽样以及拉丁超立方体(Latin Hypercube)策略生成了一千个架构,并在COCO-mini数据集上进行训练,从而构建了一个LightGBM替代预测器。 为了提高该预测器的精度,特别是在高性能区间内——这对于NAS来说是最相关的部分,我们提出了一种自我进化机制。通过使用现有的预测器来发现并评估具有信息量的架构,在每个迭代中逐步将预测器的训练分布与高性能前沿对齐,这种机制使得预测更加准确和可靠。 采用该方法后,我们将架构池增加到了1500个,并且整个集合的R2系数从0.770提高到0.815,而稀疏肯德尔τ(Sparse Kendall Tau)也从0.694提升至0.752。这表明最终预测器不仅具有强大的预测准确性,而且在排名一致性方面表现良好。 最后,我们利用该预测器作为进化搜索的适应度函数,以发现可以在COCO-mini上与官方YOLOv8-YOLO12基准模型相媲美的架构,并且其延迟相当。这证实了我们的替代预测器能够区分出高性能的目标检测架构的能力。 简而言之,通过这些创新和改进,我们为神经网络架构搜索在目标检测领域的应用提供了新的工具和策略。
https://arxiv.org/abs/2603.09405