How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
我们介绍了一种名为神经粒子自动机(Neural Particle Automata,NPA)的新模型,它是对静态格点系统中的神经细胞自动机(Neural Cellular Automata,NCA)进行拉格朗日泛化的动态粒子系统的扩展。与经典欧拉方法下的NCA不同,在这种情况下,每个单元被固定在像素或体素上,NPA将每个单元视为具有连续位置和内部状态的粒子,这两个参数都通过一个共享且可学习的神经规则更新。基于粒子的这一形式化方法清晰地界定了各细胞个体性,允许异质动态,并仅对存在活动的区域进行计算。 然而,粒子系统也带来了一些挑战:邻居关系是动态变化的,直接实现局部相互作用会导致其复杂度随粒子数量呈二次增长。为了解决这些问题,我们用可微分的光滑粒子流体动力学(Smoothed Particle Hydrodynamics,SPH)算子替代了网格感知方法,并且利用内存高效、CUDA加速的核心进行支持,从而实现了端到端的大规模训练。 在包括形态发生、点云分类和基于粒子的纹理合成等任务中,我们展示了NPA不仅保留了NCA的关键特性(如鲁棒性和自我再生),而且还赋予粒子系统特有的新行为。综上所述,这些结果将NPA定位为一种紧凑型神经模型,用于学习自组织的粒子动力学。
https://arxiv.org/abs/2601.16096
Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
广泛采用的医学图像分割方法虽然高效,但主要是确定性的,并且对自然语言提示不太友好。因此,它们缺乏估计多种提案、人机交互和跨模态适应的能力。最近,文本到图像的扩散模型显示出弥合这一差距的潜力。然而,从头开始训练这些模型需要大量的数据集——这对医学图像分割来说是一个限制。此外,它们通常仅限于二值分割,并且不能以自然语言提示为条件进行操作。 为此,我们提出了一种称为ProGiDiff的新框架,该框架利用现有的图像生成模型来实现医学图像分割的目的。具体而言,我们提出了一个类似ControlNet的控制机制和一个自定义编码器,适用于图像条件化,可以引导预训练的扩散模型输出分割掩码。通过提示目标器官,它自然地扩展到了多类设置。 我们在CT图像上的器官分割实验中展示了与先前方法相比的强大性能,并且可以从“专家在循环”(expert-in-the-loop)设置中受益匪浅,以利用多种提案。重要的是,我们证明了学习到的控制机制可以通过低秩、少量样本适应轻松转移到对MR图像进行分割。 此框架和方法表明,在医学图像分割领域,通过采用先进的文本引导技术结合现有生成模型可以显著提升算法的能力与灵活性,尤其是在处理跨模态数据时展现出了巨大的潜力。
https://arxiv.org/abs/2601.16060
Planetary surfaces are typically analyzed using high-level semantic concepts in natural language, yet vast orbital image archives remain organized at the pixel level. This mismatch limits scalable, open-ended exploration of planetary surfaces. Here we present MarScope, a planetary-scale vision-language framework enabling natural language-driven, label-free mapping of Martian landforms. MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs. This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval, enabling arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978. Applications further show that it extends beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale. MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets.
行星表面通常使用自然语言中的高级语义概念进行分析,然而,大量的轨道图像档案仍然以像素级别组织。这种不匹配限制了对行星表面的大规模、开放式探索。在这里,我们介绍了MarScope,这是一种基于视觉和语言的行星级框架,能够通过无标签的自然语言驱动方式绘制火星地形图。MarScope在共享语义空间中将行星图像与文本进行对齐,该框架是在超过20万份精选的图文配对基础上训练出来的。 这一框架通过用灵活的语义检索取代预先定义的分类,在5秒内实现了在整个星球上任意用户查询,并且F1评分最高可达0.978。进一步的应用表明,MarScope超越了形态分类,能够促进基于过程的分析以及以行星规模为单位的相似性地理地貌绘制。 MarScope建立了新的研究范式:自然语言作为大规模地空间数据集进行科学研究的直接接口。
https://arxiv.org/abs/2601.15949
Model merging (MM) offers an efficient mechanism for integrating multiple specialized models without access to original training data or costly retraining. While MM has demonstrated success in domains like computer vision, its role in recommender systems (RSs) remains largely unexplored. Recently, Generative Recommendation (GR) has emerged as a new paradigm in RSs, characterized by rapidly growing model scales and substantial computational costs, making MM particularly appealing for cost-sensitive deployment scenarios. In this work, we present the first systematic study of MM in GR through a contextual lens. We focus on a fundamental yet underexplored challenge in real-world: how to merge generative recommenders specialized to different real-world contexts, arising from temporal evolving user behaviors and heterogeneous application domains. To this end, we propose a unified framework MMGRid, a structured contextual grid of GR checkpoints that organizes models trained under diverse contexts induced by temporal evolution and domain diversity. All checkpoints are derived from a shared base LLM but fine-tuned on context-specific data, forming a realistic and controlled model space for systematically analyzing MM across GR paradigms and merging algorithms. Our investigation reveals several key insights. First, training GR models from LLMs can introduce parameter conflicts during merging due to token distribution shifts and objective disparities; such conflicts can be alleviated by disentangling task-aware and context-specific parameter changes via base model replacement. Second, incremental training across contexts induces recency bias, which can be effectively balanced through weighted contextual merging. Notably, we observe that optimal merging weights correlate with context-dependent interaction characteristics, offering practical guidance for weight selection in real-world deployments.
模型合并(MM)提供了一种高效机制,可在不访问原始训练数据或重新训练的情况下集成多个专业模型。尽管在计算机视觉等领域中MM已经展示了成功案例,但在推荐系统(RSs)中的作用仍然很大程度上未被探索。最近,生成式推荐(GR)作为推荐系统的新范例出现,其特点是由快速扩大的模型规模和显著的计算成本所驱动,这使得对于成本敏感部署场景而言,MM特别具有吸引力。 在此研究中,我们首次通过情境视角系统性地探讨了MM在GR中的应用。我们专注于现实世界中一个基本但鲜被探索的挑战:如何合并针对不同真实世界情景专业化的生成推荐器,这些情景源于用户行为的时间演变和异构的应用领域。为此,我们提出了一种统一框架MMGRid,这是一个由时间演化和领域多样性所诱导的不同情境中的GR检查点组成的结构化上下文网格,所有的检查点都源自于一个共享的基础LLM(大型语言模型),但经过特定情境数据的微调,形成了一个现实且可控的模型空间,可用于系统性地分析不同GR范式及合并算法间的MM。 我们的研究揭示了几个关键见解。首先,在从基础LLM训练GR模型时,由于标记分布偏移和目标差异等原因,在合并过程中可能会引入参数冲突;通过使用基于任务感知与上下文特定参数变化的基模型替换来拆分这些变化可以减轻这种冲突。其次,跨情境进行增量训练会导致近期偏差问题,可以通过加权上下文合并有效地平衡这种倾向。值得注意的是,我们观察到最佳合并权重会随着交互特征在不同上下文中依赖性而有所不同,这为实际部署中的权重选择提供了实用指导。 通过这项工作,我们旨在揭示生成式推荐系统领域模型合并的关键挑战与机遇,并为进一步优化其应用提供理论基础和实践建议。
https://arxiv.org/abs/2601.15930
Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.
神经元分割是重建完整神经连接组的基础,这对于解析大脑的功能组织至关重要。然而,神经元不规则的形态和密集交织的结构使得这一任务格外具有挑战性。基于CNN的方法由于缺乏长距离上下文信息而难以解决模糊边界的问题,而基于Transformer的方法则因在划分补丁时丢失体素级别的细节而导致边界不够精确。为了克服这些限制,我们提出了NeuroMamba——一个多视角框架,它利用了Mamba的线性复杂度特性来实现无分割块的全局建模,并与互补的局部特征建模相结合,从而高效地捕捉长距离依赖关系同时精细地保留体素级别的细节。 具体来说,我们设计了一种通道门控边界区分特徵提取器(BDFE),以增强局部形态线索。与此相辅相成的是,我们引入了空间连续特征提取器(SCFE),它将分辨率感知扫描机制整合到视觉Mamba架构中,能够自适应地建模不同数据分辨率下的全局依赖关系。 最后,通过一种跨调制机制协同融合这些多视角特性。我们的方法在四个公开的电子显微镜(EM)数据集上展示了最先进的性能,并且验证了其对各向异性及各向同性分辨率的卓越适应能力。源代码将开源提供。
https://arxiv.org/abs/2601.15929
Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient's health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at this https URL.
对比剂在放射影像学中扮演着关键角色,因为它能增强病灶的显影度并提高肿瘤相关疾病的检测效果。然而,由于患者的健康状况或可用医疗资源的不同,使用对比剂并不总是可行的。最近的研究探索了基于人工智能的图像转换技术,旨在从非对比扫描直接合成对比增强图像,以减少副作用并简化临床工作流程。不过,这一方向的发展受到了数据限制的制约:(1)现有的公开数据集几乎只关注脑部相关成对MR模式;(2)其他数据集中包含部分配对的数据但存在模态缺失或时间戳丢失的问题,并且空间对齐不完美;(3)CT与CTC或DCE阶段之间的显式标注常常缺失;(4)大量资源仍为私有。 为了弥补这些不足,我们推出了首个公开的、完全成对的跨癌症医学影像数据集,涵盖了11个人体器官。该MR数据包括完整的动态对比增强(DCE)序列,覆盖了三个阶段(DCE1-DCE3),而CT数据则提供了非对比与对比增强获取之间的配对(CTC)。此数据集经过精心整理以确保解剖对应关系,从而能够严谨评估一对一、多对一和多对多的转换场景(例如,从非对比输入预测DCE阶段)。基于这一资源,我们建立了一个全面基准。我们报告了当代图像到图像翻译代表性基线的结果,并公开发布数据集与基准以激发关于安全有效的对比合成的研究,这直接关联到多器官肿瘤影像工作流程。 我们的代码和数据集可以在以下网址获得:[此链接处应填入实际的公开访问URL]。
https://arxiv.org/abs/2601.15884
With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.
随着对无设备和隐私保护感应解决方案的需求日益增长,Wi-Fi 感应已成为人体姿态估计(HPE)的一种有前景的方法。然而,现有的方法通常直接处理大量的信道状态信息(CSI)数据,最终导致网络资源紧张。本文介绍了 TinySense,这是一种高效的压缩框架,旨在增强基于 Wi-Fi 的人体感应的可扩展性。我们的方法建立在一种新的向量量化生成对抗网络 (VQGAN) 之上。具体而言,通过利用 VQGAN 学习到的代码本,TinySense 显著减少了 CSI 数据的同时保持了进行可靠 HPE 所需的精度。为了优化压缩,我们采用了 K-means 算法,动态调整压缩比特率以将大规模预训练的代码本聚类成较小的子集。此外,还融入了一个 Transformer 模型来缓解比特率损失,在不可靠的网络条件下提高了鲁棒性。我们在实验测试平台上使用 Jetson Nano 和 Raspberry Pi 作为原型机,测量延迟和网络资源使用情况。广泛的结果表明,TinySense 在相同的压缩比率下显著优于最先进的压缩方案,HPE 准确度得分(PCK20)最高可提高1.5倍。它还分别将延迟降低了最多 5 倍,并且减少了多达 2.5 倍的网络开销。代码库在线提供:[此处插入实际链接]。
https://arxiv.org/abs/2601.15838
Deep image steganography (DIS) has achieved significant results in capacity and invisibility. However, current paradigms enforce the secret image to maintain the same resolution as the cover image during hiding and revealing. This leads to two challenges: secret images with inconsistent resolutions must undergo resampling beforehand which results in detail loss during recovery, and the secret image cannot be recovered to its original resolution when the resolution value is unknown. To address these, we propose ARDIS, the first Arbitrary Resolution DIS framework, which shifts the paradigm from discrete mapping to reference-guided continuous signal reconstruction. Specifically, to minimize the detail loss caused by resolution mismatch, we first design a Frequency Decoupling Architecture in hiding stage. It disentangles the secret into a resolution-aligned global basis and a resolution-agnostic high-frequency latent to hide in a fixed-resolution cover. Second, for recovery, we propose a Latent-Guided Implicit Reconstructor to perform deterministic restoration. The recovered detail latent code modulates a continuous implicit function to accurately query and render high-frequency residuals onto the recovered global basis, ensuring faithful restoration of original details. Furthermore, to achieve blind recovery, we introduce an Implicit Resolution Coding strategy. By transforming discrete resolution values into dense feature maps and hiding them in the redundant space of the feature domain, the reconstructor can correctly decode the secret's resolution directly from the steganographic representation. Experimental results demonstrate that ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity.
深度图像隐写(DIS)在容量和不可见性方面取得了显著成果。然而,当前的隐写范式强制要求秘密图像在隐藏和揭示时必须与载体图像保持相同的分辨率。这导致了两个挑战:当秘密图像与载体图像的分辨率不一致时,在隐藏前必须进行重采样,从而导致恢复过程中细节损失;而且,如果不知道原始分辨率值,则无法将秘密图像恢复到其原有的分辨率。为了解决这些问题,我们提出了ARDIS,这是第一个支持任意分辨率的深度图像隐写框架,它将隐写的范式从离散映射转变为以参考为导向的连续信号重建。 具体来说,为了最小化由于分辨率不匹配导致的细节损失,我们在隐藏阶段设计了一种频率解耦架构。该架构把秘密信息分解成与载体图像分辨率一致的全局基底和一个不受分辨率影响的高频潜码,并将这些信息隐藏在固定分辨率的载体中。其次,在恢复阶段,我们提出了一种以潜代码引导的隐式重构器来进行确定性恢复操作。这种重构器利用恢复出的详细潜代码来调节连续的隐函数,从而准确地查询并渲染高频残差到恢复后的全局基底上,确保原始细节的真实还原。 此外,为了实现盲恢复,我们引入了一种隐式分辨率编码策略。通过将离散的分辨率值转换为密集特征图,并将其隐藏在特征域的冗余空间中,重构器可以直接从隐写表示中正确解码出秘密图像的分辨率值,而无需额外的信息输入。 实验结果表明,ARDIS在不可见性和跨分辨率恢复精度方面显著优于现有的最先进的方法。
https://arxiv.org/abs/2601.15739
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
Realistic network traffic simulation is critical for evaluating intrusion detection systems, stress-testing network protocols, and constructing high-fidelity environments for cybersecurity training. While attack traffic can often be layered into training environments using red-teaming or replay methods, generating authentic benign background traffic remains a core challenge -- particularly in simulating the complex temporal and communication dynamics of real-world networks. This paper introduces TempoNet, a novel generative model that combines multi-task learning with multi-mark temporal point processes to jointly model inter-arrival times and all packet- and flow-header fields. TempoNet captures fine-grained timing patterns and higher-order correlations such as host-pair behavior and seasonal trends, addressing key limitations of GAN-, LLM-, and Bayesian-based methods that fail to reproduce structured temporal variation. TempoNet produces temporally consistent, high-fidelity traces, validated on real-world datasets. Furthermore, we show that intrusion detection models trained on TempoNet-generated background traffic perform comparably to those trained on real data, validating its utility for real-world security applications.
现实网络流量仿真对于评估入侵检测系统、压力测试网络协议以及构建高保真的网络安全培训环境至关重要。虽然通常可以通过红队演习或重播方法将攻击流量融入训练环境中,但生成真实的良性背景流量仍然是一个核心挑战——尤其是在模拟现实世界网络中复杂的时序和通信动态方面。本文介绍了TempoNet,这是一种结合多任务学习与多标记时间点过程的新型生成模型,旨在联合建模间隔时间和所有数据包及流头字段。 TempoNet能够捕捉到精细的时间模式以及更高阶的相关性,如主机对行为和季节趋势,并且解决了基于GAN(生成对抗网络)、LLM(语言生成模型)和贝叶斯方法在重现结构化时间变化方面所面临的关键限制。通过实际数据集验证,TempoNet可以产生时序一致、高保真的流量跟踪。 此外,我们还展示了使用TempoNet生成的背景流量训练入侵检测系统所获得的结果与基于真实数据训练的结果相当,从而证实了其在现实世界安全应用中的实用性和有效性。
https://arxiv.org/abs/2601.15663
Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.
医疗保健组织开始将代理型人工智能(agentic AI)嵌入日常工作流程中,包括临床文档支持和早期预警监测。随着这些能力在不同部门和供应商之间扩散,医疗机构面临着“代理泛滥”的问题,这导致了重复的代理、不明确的责任划分、控制措施的一致性不足以及超出原始使用场景的工具权限持续存在。 现有的人工智能治理框架强调生命周期风险管理,但为代理舰队日常运营提供的指导有限。我们提出了一种综合代理生命周期管理(Unified Agent Lifecycle Management, UALM)蓝图,该蓝图基于快速实践导向的治理标准、代理安全文献和医疗保健合规要求的合成而得出。UALM将反复出现的问题映射到五个控制层面:(1) 身份与角色注册表;(2) 交响乐调度与跨域调解;(3) 受保护健康信息(PHI)限制下的上下文与记忆;(4) 运行时策略执行及紧急停止触发机制;以及 (5) 生命周期管理和退役,包括凭证撤销和审计日志记录。 一个配套的成熟度模型支持分阶段采用。UALM为医疗保健领域的CIOs、CISOs(首席信息安全官)和临床领导者提供了一种可实施模式,这种模式能够进行合规性审查,并在保持本地创新的同时,在临床和行政领域实现更安全的大规模应用扩展。
https://arxiv.org/abs/2601.15630
Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at this https URL.
跨学科的脑电图(EEG)情感识别(EER)仍然面临挑战,主要是由于个体间的强烈差异导致的信号分布变化,以及与情感相关神经表示在空间组织和时间演化上的高度复杂性。现有的方法通常单独改进空间建模、时间建模或泛化策略,这限制了它们在一个统一框架内对齐跨个体表示的能力,同时捕捉多尺度动态并抑制特定于个人的偏差。为了解决这些差距,我们提出了一种基于区域感知时空模型与协作领域泛化的框架(RSM-CoDG),用于跨学科EEG情感识别。 RSM-CoDG融合了从功能性脑区划分中得出的认知科学先验知识来构建区域级别的空间表示,从而提高跨个体的可比性。此外,它采用多尺度时间建模以描述情绪引发的神经活动动态演化。该框架还采用了协作领域泛化策略,在完全未见过的目标个体设置下,结合多维度约束减少特定于个人的偏差,这增强了对未知个体的泛化能力。 在SEED系列数据集上的大量实验结果表明,RSM-CoDG持续优于现有的竞争方法,提供了一种有效的方法来提高鲁棒性。源代码可在提供的链接中获取。
https://arxiv.org/abs/2601.15615
Current business environments require organizations to continuously reconfigure cross-functional processes, yet enterprise systems are still organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile large language models (LLMs) excel at interpreting natural language and unstructured data but lack deterministic, verifiable execution of complex business logic. To address this gap, here we introduce AUTOBUS, an Autonomous Business System that integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a coherent neuro-symbolic AI architecture for orchestrating end-to-end business initiatives. AUTOBUS models an initiative as a network of tasks with explicit pre/post conditions, required data, evaluation rules, and API-level actions. Enterprise data is organized as a knowledge graph whose entities, relationships, and constraints are translated into logic facts and foundational rules, providing the semantic grounding for task reasoning. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs, which are executed by a logic engine that enforces constraints, coordinates auxiliary tools, and orchestrate execution of actions and outcomes. Humans define and maintain the semantics, policies and task instructions, curate tools, and supervise high-impact or ambiguous decisions, ensuring accountability and adaptability. We detail the AUTOBUS architecture, the anatomy of the AI agent generated logic programs, and the role of humans and auxiliary tools in the lifecycle of a business initiative.
当前的商业环境要求组织不断重新配置跨职能流程,然而企业系统仍然围绕着孤立的部门、僵化的流程和硬编码自动化进行组织。与此同时,大型语言模型(LLMs)在解释自然语言和非结构化数据方面表现出色,但在执行复杂的业务逻辑时缺乏确定性和可验证性。为了弥补这一差距,我们在此引入了AUTOBUS——一个自治商务系统,它将基于LLM的人工智能代理、谓词逻辑编程以及以企业语义为中心的企业数据整合成一种连贯的神经符号人工智能架构,用于协调端到端的商业举措。 在AUTOBUS中,一项倡议被建模为任务网络,这些任务具有明确的前提和后果条件、所需的数据、评估规则及API级别的操作。企业的数据组织成为知识图谱,其中实体、关系和约束转换成了逻辑事实和基础规则,提供了任务推理所需的语义基础。核心AI代理将任务指令、企业语义以及可用工具综合成特定于任务的逻辑程序,这些程序由一个执行引擎来运行,该执行引擎强制执行限制条件、协调辅助工具,并编排操作及结果。 人类定义并维护语义、政策和任务指令,管理工具,并监督重要或模糊的决策,以确保责任性和适应性。我们详细描述了AUTOBUS架构、AI代理生成逻辑程序的解剖结构以及在商业举措生命周期中的人类作用和辅助工具的作用。
https://arxiv.org/abs/2601.15599
Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework's convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.
胸部X光分类器中的偏见通常源自性别和年龄相关的捷径,导致少数群体的系统性漏诊。先前基于像素空间属性中和方法(依赖于卷积编码器)虽然减少了这种属性泄露,但并未完全消除在临床上可接受的编辑强度下的属性泄漏。这项研究评估了在属性中立框架中用Vision Transformer骨干网络替换U-Net卷积编码器能否减少人口统计学特征泄露,同时保持诊断准确性。 使用ChestX-ray14数据集训练了一种高效的数据图像变换小模型(DeiT-S)中和器,并生成了具有十一级编辑强度的修改后的图像。通过独立的人工智能裁判评估这些修改后图像中的属性泄漏情况,并用卷积神经网络(ConvNet)进行疾病预测。 在适度编辑水平(alpha = 0.5)下,Vision Transformer (ViT) 中和器将患者性别的识别曲线下面积(AUC)降低到约0.80,比原框架中使用的卷积U-Net编码器低10个百分点左右。尽管训练的周期只有后者的半数。同时,在15种发现情况下的宏受试者操作特征曲线下面积(ROC AUC)保持在原始未编辑基线的五个百分点范围内,最差情况下子群体AUC仍接近0.70。 这些结果表明,全局自我注意视觉模型能够在不牺牲临床实用性的情况下进一步抑制属性泄露,为更公平的胸部X光AI提供了一条实用途径。
https://arxiv.org/abs/2601.15490
Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM-GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11\% the accuracy on downstream disease type prediction compared to current state-of-the-art generative models. Code will be available at: this https URL
生物医学研究越来越依赖于整合多种数据模态,包括基因表达谱、医学影像和临床元数据。虽然医学影像和临床元数据在临床上常规收集,但基因表达数据因其严格的隐私规定以及昂贵的实验室实验而给广泛的研究应用带来了独特的挑战。为了解决这些问题,我们提出了GeMM-GAN(一种基于组织病理切片及临床元数据条件化的生成对抗网络),旨在合成真实的基因表达谱。该模型结合了一个用于图像补丁的Transformer编码器和一个最后的跨注意力机制(在补丁与文本标记之间)来产生一个条件向量,以引导生成模型输出生物学上连贯且功能意义明确的基因表达谱。 我们在TCGA数据集上评估了我们的方法,并证明了该框架优于标准生成模型,在下游疾病类型预测方面比现有的最佳技术提高了超过11%的准确率。此项目代码将在提供的网址上公开: [在此处替换为实际链接](请根据原文中提及的确切URL进行相应调整) 这个创新的方法不仅有助于克服基因表达数据收集和隐私方面的障碍,还通过提高合成基因表达谱的真实性和功能意义,进一步提升了疾病研究的能力。
https://arxiv.org/abs/2601.15392
The rapid expansion of research across machine learning, vision, and language has produced a volume of publications that is increasingly difficult to synthesize. Traditional bibliometric tools rely mainly on metadata and offer limited visibility into the semantic content of papers, making it hard to track how research themes evolve over time or how different areas influence one another. To obtain a clearer picture of recent developments, we compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025 and construct a multidimensional profiling pipeline to organize and analyze their textual content. By combining topic clustering, LLM-assisted parsing, and structured retrieval, we derive a comprehensive representation of research activity that supports the study of topic lifecycles, methodological transitions, dataset and model usage patterns, and institutional research directions. Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies, as well as the gradual stabilization of areas such as neural machine translation and graph-based methods. These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions. Code and dataset: this https URL
机器学习、视觉和语言领域的研究迅速扩展,产生了大量的出版物,这些内容越来越难以综合分析。传统的引文计量工具主要依赖于元数据,对论文的语义内容提供的可见度有限,因此很难追踪研究主题随时间如何演变或不同领域间相互影响的情况。 为了更清楚地了解最近的发展情况,我们收集了2020年至2025年间来自22个重要会议的超过10万篇论文,并构建了一个多维分析管道来组织和分析这些文本内容。通过结合主题聚类、大型语言模型辅助解析以及结构化检索方法,我们得出了对研究活动的一个全面表示,这有助于研究主题生命周期、方法学转变、数据集及模型使用模式以及机构研究方向的探讨。 我们的分析突显了几项重要的变化趋势,包括安全相关研究的增长、跨模态推理的发展以及代理导向型研究的进步,同时一些领域如神经机器翻译和基于图的方法也逐渐趋于稳定。这些发现提供了有关AI研究如何演变的一个基于证据的观点,并为理解更广泛的趋势及识别新兴方向提供了一个资源。 代码和数据集:[可在此处插入具体链接] (原文中的“this https URL”可能是引用实际的在线地址,这里替换成了描述)
https://arxiv.org/abs/2601.15170
Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision-making. Non-experts struggle to create effective visualizations, leading to suboptimal insights and diverting expert time. This paper investigates how to capture and embed human domain knowledge into AI agent systems through an industrial case study. We propose a software engineering framework to capture human domain knowledge for engineering AI agents in simulation data visualization by augmenting a Large Language Model (LLM) with a request classifier, Retrieval-Augmented Generation (RAG) system for code generation, codified expert rules, and visualization design principles unified in an agent demonstrating autonomous, reactive, proactive, and social behavior. Evaluation across five scenarios spanning multiple engineering domains with 12 evaluators demonstrates 206% improvement in output quality, with our agent achieving expert-level ratings in all cases versus baseline's poor performance, while maintaining superior code quality with lower variance. Our contributions are: an automated agent-based system for visualization generation and a validated framework for systematically capturing human domain knowledge and codifying tacit expert knowledge into AI agents, demonstrating that non-experts can achieve expert-level outcomes in specialized domains.
关键领域知识通常掌握在少数专家手中,这导致了组织在可扩展性和决策制定方面的瓶颈。非专家难以创建有效的可视化工具,从而产生了次优的洞察力,并浪费了专家的时间。本文通过一个工业案例研究探讨如何捕捉并嵌入人类领域的专业知识到AI代理系统中。我们提出了一种软件工程框架,用于捕捉人类领域的知识,以便在模拟数据可视化中为工程师打造人工智能代理。该框架通过增强大型语言模型(LLM)、请求分类器、检索增强生成(RAG)代码生成系统、编码专家规则和统一的可视化设计原则来实现这一目标,这些要素共同构成了一个具有自主性、响应性、主动性和社交行为能力的代理。 在涵盖多个工程领域的五个场景中,通过12名评估者的评价显示,在输出质量上取得了206%的改进。我们的代理在所有情况下均达到了专家级别的评分,而基线方法的表现则相对较差,并且与基线相比,我们还保持了更高的代码质量和更低的方差。 本研究的贡献包括:一种用于生成可视化的自动化代理系统和一个经过验证的框架,该框架能够以系统的模式捕捉人类领域的知识,并将隐性专家知识编码到AI代理中。我们的成果表明,在专业领域内非专家也可以达到专家水平的结果。
https://arxiv.org/abs/2601.15153
Polycystic Ovary Syndrome (PCOS) is the most familiar endocrine illness in women of reproductive age. Many Bangladeshi women suffer from PCOS disease in their older age. The aim of our research is to identify effective vision-based medical image analysis techniques and evaluate hybrid models for the accurate detection of PCOS. We introduced two novel hybrid models combining convolutional and transformer-based approaches. The training and testing data were organized into two categories: "infected" (PCOS-positive) and "noninfected" (healthy ovaries). In the initial stage, our first hybrid model, 'DenConST' (integrating DenseNet121, Swin Transformer, and ConvNeXt), achieved 85.69% accuracy. The final optimized model, 'DenConREST' (incorporating Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2), demonstrated superior performance with 98.23% accuracy. Among all evaluated models, DenConREST showed the best performance. This research highlights an efficient solution for PCOS detection from ultrasound images, significantly improving diagnostic accuracy while reducing detection errors.
多囊卵巢综合症(PCOS)是育龄妇女中最常见的内分泌疾病之一。许多孟加拉国女性在较年长时会患上这种病症。我们的研究旨在识别有效的基于视觉的医学图像分析技术,并评估用于准确检测 PCOS 的混合模型。我们引入了两种结合卷积和基于变换器方法的新颖混合模型。训练和测试数据被分为两大类:“感染”(PCOS 阳性)和“非感染”(健康卵巢)。在初始阶段,我们的第一个混合模型 'DenConST'(融合 DenseNet121、Swin Transformer 和 ConvNeXt),达到了 85.69% 的准确率。最终优化的模型 'DenConREST'(包含 Swin Transformer、ConvNeXt、DenseNet121、ResNet18 和 EfficientNetV2),展示了更优的表现,其准确率达到 98.23%。在所有评估模型中,DenConREST 表现最佳。这项研究强调了一种有效的解决方案,可以从超声图像中检测 PCOS,显著提高了诊断准确性并减少了误检率。
https://arxiv.org/abs/2601.15119
Bangla music is enrich in its own music cultures. Now a days music genre classification is very significant because of the exponential increase in available music, both in digital and physical formats. It is necessary to index them accordingly to facilitate improved retrieval. Automatically classifying Bangla music by genre is essential for efficiently locating specific pieces within a vast and diverse music library. Prevailing methods for genre classification predominantly employ conventional machine learning or deep learning approaches. This work introduces a novel music dataset comprising ten distinct genres of Bangla music. For the task of audio classification, we utilize a recurrent neural network (RNN) architecture. Specifically, a Long Short-Term Memory (LSTM) network is implemented to train the model and perform the classification. Feature extraction represents a foundational stage in audio data processing. This study utilizes Mel-Frequency Cepstral Coefficients (MFCCs) to transform raw audio waveforms into a compact and representative set of features. The proposed framework facilitates music genre classification by leveraging these extracted features. Experimental results demonstrate a classification accuracy of 78%, indicating the system's strong potential to enhance and streamline the organization of Bangla music genres.
孟加拉音乐以其独特的文化背景而丰富多样。如今,由于数字和实体格式中可用的音乐数量呈指数级增长,因此对音乐类型进行分类变得非常重要。为了方便高效地检索这些音乐作品,有必要根据其特性对其进行索引。自动将孟加拉音乐按流派分类对于在庞大且多样的音乐库中定位特定曲目至关重要。 当前用于音乐流派分类的方法主要采用传统的机器学习或深度学习方法。本研究引入了一个包含十种不同孟加拉音乐类型的全新数据集。为了完成音频分类任务,我们使用了递归神经网络(RNN)架构,并具体实施了一种长期短期记忆(LSTM)网络来训练模型并执行分类。 特征提取是音频数据分析中的基础阶段。在此项研究中,我们利用梅尔频率倒谱系数(MFCCs),将原始的音频波形转换为紧凑且具有代表性的特征集合。所提出的框架通过使用这些提取出的特征来进行音乐流派分类。 实验结果显示,该系统的分类准确率为78%,表明其在改进和优化孟加拉音乐类型的组织方面具有强大的潜力。
https://arxiv.org/abs/2601.15083