Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.
在MRI中准确地描绘急性缺血性脑卒中的病变是诊断和管理的重要组成部分。近年来,深度学习模型已被成功应用于此类病变的自动分割任务上。尽管大多数提出的架构基于U-Net框架,但它们主要通过选择不同的损失函数、深层监督、残差连接以及注意力机制来区别开来。此外,许多实现并未公开发布,并且针对急性缺血性脑卒中(AIS)病变分割的最佳配置仍然不明确。 在这项工作中,我们引入了ISLA(缺血性脑卒中病变分析仪),这是一种新的深度学习模型,用于从弥散加权MRI图像中进行AIS病变的分割。该模型是在三个多中心数据库上训练而成的,这些数据库包含超过1500名急性缺血性脑卒中的参与者的数据。通过系统地优化损失函数、卷积架构、深层监督以及注意力机制,我们开发了一个稳健的分割框架。此外,为了提高在外部临床数据集上的泛化能力,我们还研究了无监督领域自适应技术。 ISLA在外部分割测试集中超越了两个最先进的AIS病变分割方法。代码和训练好的模型将被公开发布,以促进再利用和可重复性研究。
https://arxiv.org/abs/2601.08732
Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.
从磁共振成像(MRI)中准确且通用地分割脑卒中病变对于推动临床研究、预后建模和个性化干预至关重要。尽管深度学习已经改善了自动病变勾画的效果,但许多现有的模型针对狭窄的影像学背景进行优化,并在独立的数据集、模式以及不同阶段的脑卒中上推广效果不佳。在这里,我们使用nnU-Net框架系统地评估了跨越急性期和慢性期多种异质性公开MRI数据集上的脑卒中病变分割情况。模型是在扩散加权成像(DWI)、流体衰减反转恢复序列(FLAIR)和T1加权MRI上训练并测试的,并在独立的数据集中进行评价。 无论在哪一阶段,模型均表现出稳健的推广能力,其分割准确性接近报告的人为标记的一致性。性能随影像学模式和训练数据特性变化而不同:在急性期脑卒中时,基于DWI训练的模型始终优于FLAIR基线模型,多模态组合仅带来轻微提升;而在慢性期脑卒中时,增加训练集大小能够提高表现,但在几百个案例之后回报递减。病变体积是决定准确性的一个关键因素:较小的病变更难以分割,并且在受限体积范围训练的模型推广效果差。 此外,MRI图像质量也限制了通用性:基于低质量扫描训练出的模型不能很好地迁移到其他数据集,而基于高质量数据训练出来的模型则能够良好地适应噪声较大的影像。预测结果与参考掩模之间的差异往往归因于人工注释本身的局限性。综上所述,这些发现表明自动化病变分割可以接近人类水平的表现,并识别影响推广性的关键因素,从而为开发病变分割工具提供信息。
https://arxiv.org/abs/2601.08701
For automated assessment of knee MRI scans, both accuracy and interpretability are essential for clinical use and adoption. Traditional radiomics rely on predefined features chosen at the population level; while more interpretable, they are often too restrictive to capture patient-specific variability and can underperform end-to-end deep learning (DL). To address this, we propose two complementary strategies that bring individuality and interpretability: radiomic fingerprints and healthy personas. First, a radiomic fingerprint is a dynamically constructed, patient-specific feature set derived from MRI. Instead of applying a uniform population-level signature, our model predicts feature relevance from a pool of candidate features and selects only those most predictive for each patient, while maintaining feature-level interpretability. This fingerprint can be viewed as a latent-variable model of feature usage, where an image-conditioned predictor estimates usage probabilities and a transparent logistic regression with global coefficients performs classification. Second, a healthy persona synthesises a pathology-free baseline for each patient using a diffusion model trained to reconstruct healthy knee MRIs. Comparing features extracted from pathological images against their personas highlights deviations from normal anatomy, enabling intuitive, case-specific explanations of disease manifestations. We systematically compare fingerprints, personas, and their combination across three clinical tasks. Experimental results show that both approaches yield performance comparable to or surpassing state-of-the-art DL models, while supporting interpretability at multiple levels. Case studies further illustrate how these perspectives facilitate human-explainable biomarker discovery and pathology localisation.
对于膝关节MRI扫描的自动化评估,准确性和可解释性对于临床使用和接受至关重要。传统放射组学依赖于在群体层面预先定义的特征选择;尽管更具可解释性,但它们往往过于严格,无法捕捉患者特异性变异,并且可能不如端到端深度学习(DL)表现良好。为了应对这一挑战,我们提出了两种互补策略,以引入个性和可解释性:放射组学指纹和健康人格。 首先,放射组学指纹是一种从MRI中动态构建的、特定于患者的特征集。与应用统一的群体层面签名不同,我们的模型会根据候选特征池预测特征的相关性,并仅选择对每位患者最具预测性的那些特征,同时保持了特征级别的可解释性。这种指纹可以视为一种特征使用的潜在变量模型,在该模型中,一个基于图像条件的预测器估计使用概率,而透明的逻辑回归则利用全局系数进行分类。 其次,健康人格为每个患者综合了一个无病理变化的基础线,这是通过训练扩散模型来重建健康膝关节MRI得到的结果。将病理性图像提取的特征与其相应的健康人格进行比较,可以突出偏离正常解剖结构的变化,并能够提供直观、特定病例的疾病表现解释。 我们系统地对比了指纹、健康人格及其组合在三项临床任务中的表现。实验结果显示,这两种方法都能获得与当前最先进的深度学习模型相当或更好的性能,同时支持多层面的可解释性。案例研究进一步说明了这些视角如何促进人类可以理解的生物标志物发现和病理定位。 这种方法不仅提高了膝关节MRI扫描自动评估的有效性和准确性,同时也增强了临床医生的理解能力和决策能力,从而推动了更个性化的医疗实践的发展。
https://arxiv.org/abs/2601.08604
Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets. We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot. In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration. We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations. While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance. This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies. Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy. Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning.
深度神经网络中层规格(如层宽度)的优化构成了一个复杂的组合挑战,受到张量形状兼容性和计算预算的约束。我们认为大型语言模型 (LLM) 可以为神经架构搜索 (NAS) 提供一种变革性的方法,能够以传统启发式方法无法实现的方式推理关于架构代码结构的问题。在本文中,我们探讨了将基于 LLM 的 NAS 框架应用于通道配置问题的应用。我们将搜索形式化为一系列条件代码生成任务,在这些任务中,LLM 根据性能遥测数据来细化架构规范。关键地,我们通过抽象语法树(AST)变异生成大量有效且形状一致的架构来解决数据稀缺的问题。虽然这些变异网络未必是高性能的,但它们提供了 LLM 学习通道配置与模型表现之间潜在关系所需的关键结构数据量。这使得 LLM 能够内化复杂的设计模式,并将其应用于优化特征提取策略。 在 CIFAR-100 数据集上的实验结果验证了这种方法的有效性,证明该模型能够显著提高准确率(具有统计学意义)。我们的分析证实,LLM 成功地获得了特定领域的架构先验知识,这使本方法与随机搜索区分开来,并突显出语言驱动设计在深度学习中的巨大潜力。
https://arxiv.org/abs/2601.08517
Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.
数字平台的用户群体不断扩张,这些平台已成为沟通、商务和连接的核心枢纽。然而,这也导致了仇恨言论和性别歧视言论在网络上的传播。人工智能模型已被证明是应对在线仇恨言论的有效解决方案,但它们在低资源语言和代码混合语言中的应用尚不充分,并且缺乏解释性。 可解释的人工智能(XAI)能够增强深度学习模型决策的透明度,在如仇恨言论检测这样敏感领域中尤为重要。本文介绍了一个多模态、可解释的网页应用程序,用于检测代码混合的语言——印地语和英语中的性别歧视文本及网络图片(memes)。该系统利用了支持多语言和多模式设置的最先进的变压器模型。对于基于文本的性别歧视识别,系统使用XLM-RoBERTa (XLM-R) 和多语言双向编码器表示模型(mBERT),分析大约4,193条评论数据集。对于从网络图片(memes)中进行的多模态性别歧视识别,该系统利用了mBERT + EfficientNet和mBERT + ResNET,并且基于一个大约有4,218个网络图片的数据集进行了训练。此外,它还通过可解释性技术提供了特征重要性得分,包括Shapley Additive Values (SHAP) 和 Local Interpretable Model Agnostic Explanations (LIME)。 该应用程序旨在成为研究人员和内容审核员的工具,以推动这一领域的进一步研究、对抗性别数字暴力,并确保一个安全的在线环境。系统通过人类评估者提供的Chatbot Usability Questionnaire(CUQ)和User Experience Questionnaire(UEQ)反馈进行了评估,从而确定整体可用性。
https://arxiv.org/abs/2601.08457
In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character's facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch-based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re-enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion.
在数字动画时代,为了为虚拟角色生成逼真的面部动画,开发了各种重定向方法。虽然在形状相似的模型之间进行面部运动重定位已经非常成功,但在对风格化或夸张的3D角色(这些角色与人类面部结构有显著差异)进行重定位时会遇到挑战。在这种情况下,重要的是要考虑到目标角色的面部结构及其可能的活动范围,以确保原始面部动作所假设的意义在重定位后得到保留。 为此,我们提出了一种基于局部补丁的重定位方法,该方法将从源表演视频中捕捉到的面部动画转移到目标风格化的3D角色上。我们的方法由三个模块组成:自动补丁提取模块从源视频帧中抽取局部补丁;重新表现模块处理这些补丁以生成相应的目标局部补丁;权重估计模块计算每帧的目标字符动画参数,从而创建完整的面部动画序列。 通过广泛的实验表明,该方法能够成功地将源面部表情的语义意义转移到具有显著面部特征比例变化的风格化角色上。
https://arxiv.org/abs/2601.08429
Objectives: To overcome challenges in diagnosing pericoronitis on panoramic radiographs, an AI-assisted assessment system integrating anatomical localization, pathological classification, and interpretability. Methods: A two-stage deep learning pipeline was implemented. The first stage used YOLOv8 to detect third molars and classify their anatomical positions and angulations based on Winter's classification. Detected regions were then fed into a second-stage classifier, a modified ResNet-50 architecture, for detecting radiographic features suggestive of pericoronitis. To enhance clinical trust, Grad-CAM was used to highlight key diagnostic regions on the radiographs. Results: The YOLOv8 component achieved 92% precision and 92.5% mean average precision. The ResNet-50 classifier yielded F1-scores of 88% for normal cases and 86% for pericoronitis. Radiologists reported 84% alignment between Grad-CAM and their diagnostic impressions, supporting the radiographic relevance of the interpretability output. Conclusion: The system shows strong potential for AI-assisted panoramic assessment, with explainable AI features that support clinical confidence.
目标:克服在全景片上诊断牙周冠炎(pericoronitis)的挑战,开发了一种结合解剖定位、病理分类和解释能力的人工智能辅助评估系统。 方法:实施了一个两阶段的深度学习管道。第一阶段使用YOLOv8来检测第三磨牙,并根据Winter's分类法对其解剖位置和倾斜角度进行分类。然后将检测到的区域输入第二阶段分类器,即修改后的ResNet-50架构,以识别提示牙周冠炎的放射学特征。为了增强临床信任度,在该系统中使用了Grad-CAM来突出显示全景片上的关键诊断区域。 结果:YOLOv8组件实现了92%的精度和92.5%的平均精度(mean average precision)。ResNet-50分类器在正常病例中的F1得分为88%,牙周冠炎病例中为86%。放射科医生报告称,Grad-CAM与他们的诊断印象之间的一致性达到了84%,这支持了该系统的影像学相关性和解释能力的输出。 结论:该系统展示了人工智能辅助全景片评估的强大潜力,并且其可解释的人工智能特性有助于提高临床信心。
https://arxiv.org/abs/2601.08401
Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
可靠地计数和生成项目序列对神经网络(包括大型语言模型LLM)来说仍然是一个重要的挑战。实际上,虽然基于串行计算的规则化符号系统可以轻松处理这一能力,但学习系统性部署计数程序对于需要通过学习获得这些技能的神经模型而言则非常困难。之前的研究表明,递归架构只能近似地追踪和枚举事件序列,并且目前尚不清楚现代深度学习系统(包括LLM)是否能够对离散符号序列部署系统的计数程序。本文旨在填补这一空白,通过研究五个最先进的LLM(包括专有、开源和推理模型)的顺序枚举能力来解决这个问题。我们采用一系列提示指令在字母列表和单词列表的任务中探查LLM的序列命名和生成任务,以探讨链式思维在计数策略自发出现中的作用。此外,我们还评估了具有相同架构但大小逐渐增加的开源模型,以查看是否遵循缩放定律来掌握计数原则,并分析序列枚举过程中的嵌入动力学,以研究数值性的涌现编码。 我们的发现表明,当明确提示时,某些LLM确实能够部署计数程序,但在仅被要求列出一个序列中项目的数量时,没有一种模型会自发地进行计数。这些结果表明,尽管具有令人印象深刻的涌现能力,但LLM目前还不能稳健且系统性地部署计数程序,这凸显了神经方法和符号化方法在组合泛化中的持久差距。
https://arxiv.org/abs/2512.04727
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
深度双重下降是深入理解深度学习模型泛化能力的关键现象之一。在这项研究中,通过对内部结构演化的关注,从经验上探讨了基于每个训练周期的双重下降(即过度拟合后的延迟泛化)。实验使用带有30%标签噪声的CIFAR-10数据集对三种不同大小的全连接神经网络进行了训练。通过将损失曲线分解为干净和有噪声的训练数据信号贡献,分别分析了各个时期内部信号的变化。 从这项分析中获得了三个主要发现: 第一,即使模型在双重下降阶段完美地拟合了带噪训练数据之后,它仍然能够实现对测试数据的强大再泛化能力,这对应于所谓的“良性过拟合”状态。 第二,在学习过程中,先学习干净的数据后学习有噪声的数据,并且随着学习的进展,它们对应的内部激活在外部层中变得越来越分离。这意味着模型仅过度拟合并适应了那些带有噪声的数据点。 第三,在所有模型的浅层中出现了一个非常大的单一激活现象;这一现象在最近的大规模语言模型中被称为“离群值”、“巨大激活”和“超级激活”。这种大规模激活的现象与输入模式有关,但与输出模式无关。 这些经验发现直接将近期的关键现象——深度双重下降、良性过拟合以及大型激活联系在一起,并支持提出了一种新的理解深度双重下降的情景。
https://arxiv.org/abs/2601.08316
A lack of standardized datasets has long hindered progress in automatic intrapulse modulation classification (AIMC) - a critical task in radar signal analysis for electronic support systems, particularly under noisy or degraded conditions. AIMC seeks to identify the modulation type embedded within a single radar pulse from its complex in-phase and quadrature (I/Q) representation, enabling automated interpretation of intrapulse structure. This paper introduces AIMC-Spec, a comprehensive synthetic dataset for spectrogram-based image classification, encompassing 33 modulation types across 13 signal-to-noise ratio (SNR) levels. To benchmark AIMC-Spec, five representative deep learning algorithms - ranging from lightweight CNNs and denoising architectures to transformer-based networks - were re-implemented and evaluated under a unified input format. The results reveal significant performance variation, with frequency-modulated (FM) signals classified more reliably than phase or hybrid types, particularly at low SNRs. A focused FM-only test further highlights how modulation type and network architecture influence classifier robustness. AIMC-Spec establishes a reproducible baseline and provides a foundation for future research and standardization in the AIMC domain.
长久以来,缺乏标准化的数据集阻碍了自动内脉冲调制分类(AIMC)的进展——这是一个雷达信号分析中至关重要的任务,尤其是在噪声或退化条件下。AIMC旨在从复杂的同相和正交(I/Q)表示中识别单个雷达脉冲内的调制类型,从而实现对内脉冲结构的自动化解读。本文介绍了AIMC-Spec,这是一个全面的人工合成数据集,用于基于频谱图的图像分类,涵盖了33种调制类型,并在13个信噪比(SNR)级别上进行了测试。 为了基准化AIMC-Spec,研究者重新实现了五种具有代表性的深度学习算法——从轻量级CNNs和去噪架构到基于变换器的网络,并以统一的输入格式对其进行了评估。结果表明,在不同信噪比下,频率调制(FM)信号被分类得更为可靠,尤其是在低SNR条件下。一项专门针对FM的测试进一步展示了调制类型与网络架构如何影响分类器的鲁棒性。 AIMC-Spec建立了一个可重复的基础,并为未来在自动内脉冲调制分类领域的研究和标准化工作提供了基础。
https://arxiv.org/abs/2601.08265
Low-cost inertial navigation systems (INS) are prone to sensor biases and measurement noise, which lead to rapid degradation of navigation accuracy during global positioning system (GPS) outages. To address this challenge and improve positioning continuity in GPS-denied environments, this paper proposes a brain-inspired GPS/INS fusion network (BGFN) based on spiking neural networks (SNNs). The BGFN architecture integrates a spiking Transformer with a spiking encoder to simultaneously extract spatial features from inertial measurement unit (IMU) signals and capture their temporal dynamics. By modeling the relationship between vehicle attitude, specific force, angular rate, and GPS-derived position increments, the network leverages both current and historical IMU data to estimate vehicle motion. The effectiveness of the proposed method is evaluated through real-world field tests and experiments on public datasets. Compared to conventional deep learning approaches, the results demonstrate that BGFN achieves higher accuracy and enhanced reliability in navigation performance, particularly under prolonged GPS outages.
低成本惯性导航系统(INS)容易受到传感器偏差和测量噪声的影响,这会导致全球定位系统(GPS)中断期间导航精度迅速下降。为了解决这一挑战并提高在无GPS环境下定位的连续性,本文提出了一种基于脉冲神经网络(SNNs)的脑启发式GPS/INS融合网络(BGFN)。BGFN架构结合了脉冲Transformer和脉冲编码器,同时从惯性测量单元(IMU)信号中提取空间特征并捕捉其时间动态。通过建模车辆姿态、特定力、角速率与基于GPS的位置增量之间的关系,该网络利用当前及历史IMU数据来估计车辆运动。通过实地测试和公开数据集上的实验评估了所提出方法的有效性。相比传统的深度学习方法,结果显示BGFN在导航性能上实现了更高的精度和增强的可靠性,尤其是在长时间GPS中断的情况下。
https://arxiv.org/abs/2601.08244
Diabetic retinopathy (DR), affecting millions globally with projections indicating a significant rise, poses a severe blindness risk and strains healthcare systems. Diagnostic complexity arises from visual symptom overlap with conditions like age-related macular degeneration and hypertensive retinopathy, exacerbated by high misdiagnosis rates in underserved regions. This study introduces TIMM-ProRS, a novel deep learning framework integrating Vision Transformer (ViT), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) with multi-modal fusion. TIMM-ProRS uniquely leverages both retinal images and temporal biomarkers (HbA1c, retinal thickness) to capture multi-modal and temporal dynamics. Evaluated comprehensively across diverse datasets including APTOS 2019 (trained), Messidor-2, RFMiD, EyePACS, and Messidor-1 (validated), the model achieves 97.8\% accuracy and an F1-score of 0.96, demonstrating state-of-the-art performance and outperforming existing methods like RSG-Net and DeepDR. This approach enables early, precise, and interpretable diagnosis, supporting scalable telemedical management and enhancing global eye health sustainability.
糖尿病视网膜病变(DR)是一种全球影响数百万人的疾病,预计病例数量还将显著增加。它会导致严重的失明风险,并对医疗系统造成压力。诊断复杂性源于视觉症状与年龄相关黄斑变性和高血压性视网膜病等状况之间的重叠现象,在医疗资源不足地区误诊率也较高。 本研究介绍了TIMM-ProRS,这是一种结合了Vision Transformer(ViT)、卷积神经网络(CNN)和图神经网络(GNN),并实现了多模态融合的新型深度学习框架。TIMM-ProRS的独特之处在于它同时利用视网膜图像和时间生物标志物(如糖化血红蛋白HbA1c和视网膜厚度)来捕捉多模态和时间动态变化。 该模型在APTOS 2019、Messidor-2、RFMiD、EyePACS及Messidor-1等多样化的数据集上进行了全面评估,取得了高达97.8%的准确率和F1评分为0.96的成绩,显示出业界领先的表现,并优于现有的方法如RSG-Net和DeepDR。 这一方法能够实现早期、精确且可解释性的诊断,支持远程医疗的大规模管理并增强全球眼健康可持续性。
https://arxiv.org/abs/2601.08240
Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs--outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at this https URL.
反刍酸中毒是一种在乳牛中常见的代谢紊乱,它会导致显著的经济损失和动物福利问题。当前的诊断方法依赖于侵入性的pH值测量,这限制了其用于连续监测的大规模应用的可能性。我们提出了FUME(融合统一多气体排放网络),这是首个针对体外条件下从双气体光学成像数据中检测反刍酸中毒的深度学习方法。我们的方法利用红外摄像机捕捉到的二氧化碳(CO2)和甲烷(CH4)排放模式来将瘤胃健康状态分类为健康、过渡期和酸性三个等级。 FUME采用了一种轻量级的双流架构,其中包括共享权重编码器、模态特定自注意力机制以及通道注意融合,这种方法能够同时优化气体羽状物分割和奶牛健康的分类。我们引入了首个包含8,967帧注释数据集的双气体OGI(光学气体成像)数据集,这些图像跨越六个不同的pH水平,并带有像素级别的分割掩模。 实验表明,FUME实现了80.99%的mIoU和98.82%的分类精度,而仅使用了1.28M参数和1.97G MACs(每秒百万乘法累加操作),在计算成本降低十倍的情况下超过了最先进的方法,在分割质量方面表现更佳。消融研究表明,CO2提供了主要的判别信号,并且双任务学习对于最佳性能至关重要。 我们的工作验证了基于气体排放监测家畜健康的可行性,为体外酸中毒检测系统的实际应用铺平了道路。代码可在该链接提供:[这个URL](请根据实际情况填写正确的URL)。
https://arxiv.org/abs/2601.08205
Aggregating multi-site brain MRI data can enhance deep learning model training, but also introduces non-biological heterogeneity caused by site-specific variations (e.g., differences in scanner vendors, acquisition parameters, and imaging protocols) that can undermine generalizability. Recent retrospective MRI harmonization seeks to reduce such site effects by standardizing image style (e.g., intensity, contrast, noise patterns) while preserving anatomical content. However, existing methods often rely on limited paired traveling-subject data or fail to effectively disentangle style from anatomy. Furthermore, most current approaches address only single-sequence harmonization, restricting their use in real-world settings where multi-sequence MRI is routinely acquired. To this end, we introduce MMH, a unified framework for multi-site multi-sequence brain MRI harmonization that leverages biomedical semantic priors for sequence-aware style alignment. MMH operates in two stages: (1) a diffusion-based global harmonizer that maps MR images to a sequence-specific unified domain using style-agnostic gradient conditioning, and (2) a target-specific fine-tuner that adapts globally aligned images to desired target domains. A tri-planar attention BiomedCLIP encoder aggregates multi-view embeddings to characterize volumetric style information, allowing explicit disentanglement of image styles from anatomy without requiring paired data. Evaluations on 4,163 T1- and T2-weighted MRIs demonstrate MMH's superior harmonization over state-of-the-art methods in image feature clustering, voxel-level comparison, tissue segmentation, and downstream age and site classification.
跨多个站点的脑部MRI数据聚合可以增强深度学习模型训练,但也会引入由于特定地点变化(如不同制造商、采集参数和成像协议差异)导致的非生物异质性,从而削弱了模型的泛化能力。最近回顾性的MRI校准试图通过标准化图像风格(例如强度、对比度、噪声模式)来减少此类站点效应,同时保留解剖内容。然而,现有方法往往依赖于有限的一对一旅行者数据或无法有效区分风格与解剖结构。此外,大多数当前的方法仅处理单序列的校准,限制了它们在现实场景中的应用,因为在这些场景中通常会获取多序列MRI图像。 为此,我们引入了一个统一框架MMH,用于跨多个站点和多序列脑部MRI图像的校准,该框架利用生物医学语义先验进行序列感知风格对齐。MMH分为两个阶段操作:(1)一个基于扩散的整体校准器,使用无风格特征导向梯度条件将MR图像映射到特定于序列的统一领域;(2)针对目标领域的特化微调器,它将全局对齐的图像调整为所需的目标领域。三平面注意力BiomedCLIP编码器通过聚合多视图嵌入来表征体积样式信息,从而在无需配对数据的情况下明确区分图像风格与解剖结构。 基于4,163张T1和T2加权MRI图像进行评估表明,MMH在图像特征聚类、体素级比较、组织分割以及下游年龄和站点分类方面,均优于现有最佳方法。
https://arxiv.org/abs/2601.08193
Geometric Representation Learning (GRL) aims to approximate the non-Euclidean topology of high-dimensional data through discrete graph structures, grounded in the manifold hypothesis. However, traditional static graph construction methods based on Euclidean distance often fail to capture the intrinsic curvature characteristics of the data manifold. Although Ollivier-Ricci Curvature Flow (OCF) has proven to be a powerful tool for dynamic topological optimization, its core reliance on Optimal Transport (Wasserstein distance) leads to prohibitive computational complexity, severely limiting its application in large-scale datasets and deep learning frameworks. To break this bottleneck, this paper proposes a novel geometric evolution framework: Resistance Curvature Flow (RCF). Leveraging the concept of effective resistance from circuit physics, RCF transforms expensive curvature optimization into efficient matrix operations. This approach achieves over 100x computational acceleration while maintaining geometric optimization capabilities comparable to OCF. We provide an in-depth exploration of the theoretical foundations and dynamical principles of RCF, elucidating how it guides the redistribution of edge weights via curvature gradients to eliminate topological noise and strengthen local cluster structures. Furthermore, we provide a mechanistic explanation of RCF's role in manifold enhancement and noise suppression, as well as its compatibility with deep learning models. We design a graph optimization algorithm, DGSL-RCF, based on this framework. Experimental results across deep metric learning, manifold learning, and graph structure learning demonstrate that DGSL-RCF significantly improves representation quality and downstream task performance.
几何表示学习(GRL)旨在通过离散图结构来近似高维数据的非欧几里得拓扑,这一过程基于流形假设。然而,传统的静态图构建方法通常依赖于欧氏距离,这种做法往往无法捕捉到数据流形内在的曲率特性。虽然奥利维尔-里奇曲率流动(OCF)已被证明是动态拓扑优化的强大工具,但其核心依赖于最优传输(Wasserstein距离),这导致了计算复杂度极高,严重限制了它在大规模数据集和深度学习框架中的应用。为了突破这一瓶颈,本文提出了一种新的几何演化框架:电阻曲率流动(RCF)。利用电路物理学中有效电阻的概念,RCF将昂贵的曲率优化转化为高效的矩阵运算。该方法实现了超过100倍的计算加速,并且在保持与OCF相当的几何优化能力的同时,还提供了更好的性能。 本文深入探讨了RCF的理论基础和动态原理,阐明了它是如何通过曲率梯度指导边权重的重新分配来消除拓扑噪声并强化局部集群结构的。此外,我们还从机制层面解释了RCF在流形增强和噪声抑制方面的作用及其与深度学习模型兼容性方面的优势。 基于这一框架,我们设计了一种图优化算法——DGSL-RCF(Deep Graph Structure Learning with Resistance Curvature Flow)。通过跨深度度量学习、流形学习以及图结构学习的实验结果表明,DGSL-RCF显著提高了表示质量和下游任务性能。
https://arxiv.org/abs/2601.08149
Deep learning-based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few-shot scenarios due to the scarcity of annotated training data. Recently, self-supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few-shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO-AugSeg, a novel framework that leverages DINOv3 features to address the few-shot medical image segmentation challenge. Specifically, we introduce WT-Aug, a wavelet-based feature-level augmentation module that enriches the diversity of DINOv3-extracted features by perturbing frequency components, and CG-Fuse, a contextual information-guided fusion module that exploits cross-attention to integrate semantic-rich low-resolution features with spatially detailed high-resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO-AugSeg consistently outperforms existing methods under limited-sample conditions. The results highlight the effectiveness of incorporating wavelet-domain augmentation and contextual fusion for robust feature representation, suggesting DINO-AugSeg as a promising direction for advancing few-shot medical image segmentation. Code and data will be made available on this https URL.
基于深度学习的自动医学图像分割在临床诊断和治疗计划中扮演着关键角色,但在标注训练数据稀缺的情况下进行的少样本场景下仍面临挑战。最近,像DINOv3这样的自监督基础模型,在大型自然图像数据集上训练后,显示出强大的密集特征提取能力,有助于解决少样本学习问题。然而,由于领域差异,直接将其应用于医学影像受限。 为此,我们提出了一种名为DINO-AugSeg的新框架,该框架利用DINOv3特征来应对少样本医疗影像分割的挑战。具体来说,我们引入了WT-Aug模块,这是一个基于小波变换的增强模块,通过扰动频率成分丰富DINOv3提取到的特征多样性;同时设计了一个CG-Fuse模块,这是一种指导上下文信息融合的模块,利用跨注意力机制将语义丰富的低分辨率特征与空间细节丰富的高分辨率特征相整合。 我们在包括MRI、CT、超声波、内窥镜和皮肤镜在内的六项公开基准测试中进行了广泛的实验。结果显示,在样本受限条件下,DINO-AugSeg持续优于现有的方法。这些结果强调了结合小波领域增强和上下文融合对于稳健的特征表示的有效性,并表明DINO-AugSeg为推进少样本医学影像分割提供了有前景的方向。 相关代码与数据将在提供的网址上公开发布:[https://this-url.com](https://this-url.com)(注意,这里使用的URL是示例性质的,请替换为您实际提供的链接)。
https://arxiv.org/abs/2601.08078
Accurately forecasting long-term atmospheric variables remains a defining challenge in meteorological science due to the chaotic nature of atmospheric systems. Temperature data represents a complex superposition of deterministic cyclical climate forces and stochastic, short-term fluctuations. While planetary mechanics drive predictable seasonal periodicities, rapid meteorological changes such as thermal variations, pressure anomalies, and humidity shifts introduce nonlinear volatilities that defy simple extrapolation. Historically, the Seasonal Autoregressive Integrated Moving Average (SARIMA) model has been the standard for modeling historical weather data, prized for capturing linear seasonal trends. However, SARIMA operates under strict assumptions of stationarity, failing to capture abrupt, nonlinear transitions. This leads to systematic residual errors, manifesting as the under-prediction of sudden spikes or the over-smoothing of declines. Conversely, Deep Learning paradigms, specifically Long Short-Term Memory (LSTM) networks, demonstrate exceptional efficacy in handling intricate time-series data. By utilizing memory gates, LSTMs learn complex nonlinear dependencies. Yet, LSTMs face instability in open-loop forecasting; without ground truth feedback, minor deviations compound recursively, causing divergence. To resolve these limitations, we propose a Hybrid SARIMA-LSTM architecture. This framework employs a residual-learning strategy to decompose temperature into a predictable climate component and a nonlinear weather component. The SARIMA unit models the robust, long-term seasonal trend, while the LSTM is trained exclusively on the residuals the nonlinear errors SARIMA fails to capture. By fusing statistical stability with neural plasticity, this hybrid approach minimizes error propagation and enhances long-horizon accuracy.
准确预测长期大气变量仍然是气象科学面临的重大挑战,因为大气系统具有混沌特性。温度数据代表了确定性的周期性气候力与随机短期波动的复杂叠加。行星力学驱动可预测的季节性周期性变化,而诸如热变、气压异常和湿度波动等快速气象变化引入了非线性不稳定性,难以简单外推。 历史上,季节数字自回归积分移动平均模型(SARIMA)一直是用于建模历史天气数据的标准方法,因其能够捕捉线性的季节趋势而备受推崇。然而,SARIMA基于平稳性的严格假设运行,在处理突发、非线性转变时表现不佳。这导致了系统性的残差误差,表现为对突然峰值的低估或对下降趋势的过度平滑。 相比之下,深度学习范式——尤其是长短期记忆网络(LSTM)——在处理复杂的时间序列数据方面表现出非凡的有效性。通过使用记忆门,LSTMs能够学习复杂的非线性依赖关系。然而,在没有真实反馈的情况下进行开环预测时,LSTMs面临不稳定性;即使是很小的偏差也会递归地累积,导致模型发散。 为解决这些限制,我们提出了一种混合SARIMA-LSTM架构。这种框架采用残差学习策略,将温度分解成一个可预测的气候成分和一个非线性天气成分。SARIMA单元建模稳健的长期季节趋势,而LSTM仅针对SARIMA未能捕捉到的非线性误差进行训练。通过结合统计稳定性和神经网络的灵活性,这种混合方法可以最小化误差传播,并提高长时期的预测准确性。
https://arxiv.org/abs/2601.07951
The European Space Agency (ESA), driven by its ambitions on planned lunar missions with the Argonaut lander, has a profound interest in reliable crater detection, since craters pose a risk to safe lunar landings. This task is usually addressed with automated crater detection algorithms (CDA) based on deep learning techniques. It is non-trivial due to the vast amount of craters of various sizes and shapes, as well as challenging conditions such as varying illumination and rugged terrain. Therefore, we propose a deep-learning CDA based on the OWLv2 model, which is built on a Vision Transformer, that has proven highly effective in various computer vision tasks. For fine-tuning, we utilize a manually labeled dataset fom the IMPACT project, that provides crater annotations on high-resolution Lunar Reconnaissance Orbiter Camera Calibrated Data Record images. We insert trainable parameters using a parameter-efficient fine-tuning strategy with Low-Rank Adaptation, and optimize a combined loss function consisting of Complete Intersection over Union (CIoU) for localization and a contrastive loss for classification. We achieve satisfactory visual results, along with a maximum recall of 94.0% and a maximum precision of 73.1% on a test dataset from IMPACT. Our method achieves reliable crater detection across challenging lunar imaging conditions, paving the way for robust crater analysis in future lunar exploration.
欧洲航天局(ESA)因其计划中的月球任务,特别是使用Argonaut着陆器的使命,对可靠的陨石坑检测有着浓厚的兴趣。这是因为陨石坑会威胁到安全的月球登陆。通常,这类问题通过基于深度学习技术的自动化陨石坑检测算法(CDA)来解决。然而,由于各种大小和形状的大量陨石坑以及光照变化和崎岖地形等挑战性条件的存在,这一任务具有相当大的复杂性和难度。 因此,我们提出了一种基于OWLv2模型的深度学习CDA方法,该模型建立在视觉变换器(Vision Transformer)之上,并且已经在多种计算机视觉任务中证明了其高度有效性。为了进行微调,我们利用了一个由IMPACT项目提供的手动标注数据集,它提供了高分辨率月球勘测轨道飞行器相机校准记录图像上的陨石坑注释。我们采用低秩适应的参数高效微调策略来插入可训练参数,并优化一个结合了完全交并比(CIoU)用于定位和对比损失用于分类的组合损失函数。 在IMPACT提供的测试数据集上,我们的方法实现了令人满意的视觉效果以及高达94.0%的最大召回率和73.1%的最大精确度。这种方法能够在具有挑战性的月球成像条件下实现可靠的陨石坑检测,为未来月球探测中的稳健陨石坑分析铺平了道路。
https://arxiv.org/abs/2601.07795
Hyperspectral image (HSI) classification presents unique challenges due to its high spectral dimensionality and limited labeled data. Traditional deep learning models often suffer from overfitting and high computational costs. Self-distillation (SD), a variant of knowledge distillation where a network learns from its own predictions, has recently emerged as a promising strategy to enhance model performance without requiring external teacher networks. In this work, we explore the application of SD to HSI by treating earlier outputs as soft targets, thereby enforcing consistency between intermediate and final predictions. This process improves intra-class compactness and inter-class separability in the learned feature space. Our approach is validated on two benchmark HSI datasets and demonstrates significant improvements in classification accuracy and robustness, highlighting the effectiveness of SD for spectral-spatial learning. Codes are available at this https URL.
高光谱图像(HSI)分类因其高光谱维度和有限的标注数据而面临独特的挑战。传统深度学习模型通常会出现过拟合现象,并且计算成本较高。自我蒸馏(Self-Distillation, SD)是一种知识蒸馏变体,其中网络从自身的预测中学习,在无需外部教师网络的情况下,已成为提升模型性能的一项有前景的战略。在这项工作中,我们探讨了将SD应用于HSI的方法,通过利用早期输出作为软目标来强制执行中间预测和最终预测之间的一致性。这一过程可以增强所学特征空间内的类内紧凑性和类间可分离性。我们的方法已在两个基准HSI数据集上进行了验证,并展示了分类准确率和鲁棒性的显著提高,突显了SD在光谱-空间学习中的有效性。代码可在上述链接中获取。
https://arxiv.org/abs/2601.07416
Attention mechanisms have become a core component of deep learning models, with Channel Attention and Spatial Attention being the two most representative architectures. Current research on their fusion strategies primarily bifurcates into sequential and parallel paradigms, yet the selection process remains largely empirical, lacking systematic analysis and unified principles. We systematically compare channel-spatial attention combinations under a unified framework, building an evaluation suite of 18 topologies across four classes: sequential, parallel, multi-scale, and residual. Across two vision and nine medical datasets, we uncover a "data scale-method-performance" coupling law: (1) in few-shot tasks, the "Channel-Multi-scale Spatial" cascaded structure achieves optimal performance; (2) in medium-scale tasks, parallel learnable fusion architectures demonstrate superior results; (3) in large-scale tasks, parallel structures with dynamic gating yield the best performance. Additionally, experiments indicate that the "Spatial-Channel" order is more stable and effective for fine-grained classification, while residual connections mitigate vanishing gradient problems across varying data scales. We thus propose scenario-based guidelines for building future attention modules. Code is open-sourced at this https URL.
注意力机制已经成为深度学习模型的核心组成部分,通道注意力和空间注意力是两种最具代表性的架构。目前关于它们融合策略的研究主要分为顺序融合和平行融合两大范式,但选择过程很大程度上依赖经验,缺乏系统性分析和统一的原则。我们建立了一个统一的框架,在该框架下对通道-空间注意力组合进行了系统的对比研究,并构建了一套包含18种拓扑结构(四类:序列型、并行型、多尺度型以及残差型)的评估套件。在两项视觉任务和九项医疗数据集中,我们发现了一个“数据规模-方法-性能”耦合规律: 1. 在小样本任务中,“通道-多尺度空间”的级联结构表现最佳。 2. 中等规模的任务最适合使用并行学习融合架构。 3. 对于大规模的任务,并行结构结合动态门控机制表现出最优的性能。 此外,实验表明,在细粒度分类任务中,“空间-通道”顺序更为稳定且有效;残差连接则能缓解不同数据规模下的梯度消失问题。因此,我们提出了基于场景的选择指南来构建未来的注意力模块。相关代码已开源,可在提供的链接处获取。
https://arxiv.org/abs/2601.07310