$\textbf{Objective:}$ Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. $\textbf{Methods:}$ We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE's predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. $\textbf{Results:}$ While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. $\textbf{Conclusion:}$ FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care.
**目标:** 大脑预测年龄差(BrainAGE)是一种反映大脑健康的神经影像生物标志物。然而,训练稳健的BrainAGE模型需要大量数据集,这通常受到隐私问题的限制。本研究评估了在缺血性卒中患者接受机械取栓治疗的情况下,联邦学习(FL)用于BrainAGE估计的效果,并探讨其与临床表型和功能结局的关系。 **方法:** 我们使用来自16家医院中心共1674名卒中患者的FLAIR脑部图像。我们实施了标准机器学习和深度学习模型,在三种数据管理策略下进行BrainAGE估算:集中式学习(合并所有数据)、联邦学习(在每个站点本地训练)以及单个站点学习。报告了预测误差,并考察了BrainAGE与血管风险因素(如糖尿病、高血压、吸烟)及卒中后3个月的功能结局之间的关系。通过逻辑回归评估调整年龄、性别、血管风险因素、卒中严重程度、MRI和动脉穿刺之间的时间间隔、先前的静脉溶栓以及再通结局后的脑龄对这些结果的预测价值。 **结果:** 虽然集中式学习提供了最准确的预测,但联邦学习在所有情况下均优于单一站点模型。对于所有模型而言,糖尿病患者的BrainAGE显著较高。将具有良好和较差功能结局的患者进行比较,并通过多元回归预测这些结局显示了BrainAGE与卒中后恢复之间的关联的重要性。 **结论:** 联邦学习可以在不集中化数据的情况下实现准确的年龄预测。脑龄与血管风险因素及卒中后的恢复之间存在显著的相关性,这突显了其在卒中护理预后模型中的潜在价值。
https://arxiv.org/abs/2506.15626
Indoor localization using deep learning (DL) has demonstrated strong accuracy in mapping Wi-Fi RSS fingerprints to physical locations; however, most existing DL frameworks function as black-box models, offering limited insight into how predictions are made or how models respond to real-world noise over time. This lack of interpretability hampers our ability to understand the impact of temporal variations - caused by environmental dynamics - and to adapt models for long-term reliability. To address this, we introduce LogNet, a novel logic gate-based framework designed to interpret and enhance DL-based indoor localization. LogNet enables transparent reasoning by identifying which access points (APs) are most influential for each reference point (RP) and reveals how environmental noise disrupts DL-driven localization decisions. This interpretability allows us to trace and diagnose model failures and adapt DL systems for more stable long-term deployments. Evaluations across multiple real-world building floorplans and over two years of temporal variation show that LogNet not only interprets the internal behavior of DL models but also improves performance-achieving up to 1.1x to 2.8x lower localization error, 3.4x to 43.3x smaller model size, and 1.5x to 3.6x lower latency compared to prior DL-based models.
基于深度学习(DL)的室内定位技术已经在将Wi-Fi信号强度指纹映射到物理位置方面展现出了强大的准确性;然而,大多数现有的DL框架作为黑盒模型运行,提供对预测机制或模型在现实世界噪声干扰下的响应变化的见解有限。这种解释性的缺乏阻碍了我们理解由环境动态引起的时变性影响的能力,并且限制了适应长期可靠性所需的模型调整能力。为了应对这一挑战,我们引入了一种新颖的逻辑门基框架——LogNet,旨在解析和增强基于DL的室内定位技术。 LogNet通过识别每个参考点(RP)中最具影响力的接入点(AP),以实现透明化的推理,并揭示环境噪声如何扰乱DL驱动下的定位决策。这种解释性使我们能够追踪并诊断模型失败的原因,并对DL系统进行调整,使其在长期部署中的表现更加稳定。 跨多个真实世界建筑物楼层布局和两年时间跨度的变异性评估显示,LogNet不仅解析了DL模型内部行为,还提升了性能:它将定位误差降低了1.1倍到2.8倍,模型尺寸减小了3.4倍到43.3倍,并且运行延迟降低了1.5倍至3.6倍,相比之前的基于DL的模型来说。
https://arxiv.org/abs/2506.15559
Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.
基于深度学习的心肌疤痕分割技术在使用延迟钆增强(LGE)心脏MRI进行准确且及时的结构性心脏病诊断和治疗规划方面展现出巨大潜力。然而,高质量心肌疤痕标签的有限可用性和变异性限制了稳健分割模型的发展。为了解决这一问题,我们引入了一个名为CLAIM的新框架:**C**linically-Guided **L**GE **A**ugmentation for Real**i**stic and Diverse **M**yocardial Scar Synthesis and Segmentation(临床指导的LGE增强用于真实且多样的心肌疤痕合成和分割)。该框架旨在生成解剖结构合理、疤痕模式多样的心肌图像。 CLAIM的核心是SMILE模块(Scar Mask generation guided by cLinical knowledgE,基于临床知识引导的心脏疤痕掩模生成),它利用临床上广泛采用的AHA 17段模型作为条件来合成具有解剖一致性和空间多样性心脏疤痕模式的图像。此外,CLAIM还采用了联合训练策略,在该策略中,心肌疤痕分割网络与生成器同步优化,旨在提高生成疤痕的真实感和心肌疤痕分割性能的准确性。 实验结果显示,与基准方法相比,CLAIM能产生结构上连贯的心脏疤痕模式,并且在与真实疤痕分布比较时具有更高的Dice相似度。这种方法使可控且逼真的心肌疤痕合成成为可能,并已在下游医学成像任务中证明了其有效性。
https://arxiv.org/abs/2506.15549
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at this https URL.
事后归因方法旨在通过突出有影响力的输入像素来解释深度学习预测。然而,这些解释非常脆弱:微小且难以察觉的输入扰动可以显著改变归因图,同时保持相同的预测不变。这种易受攻击性削弱了它们的信任度,并要求对像素级别的归因分数进行严格的鲁棒性保证。 我们引入了一个认证框架,该框架通过随机平滑技术为任何黑盒归因方法提供像素级的鲁棒性保证。通过稀疏化和平滑化归因图,我们将任务重新表述为分割问题,并针对$\ell_2$边界扰动验证每个像素的重要性。此外,我们提出了三种评估指标来衡量认证后的鲁棒性、定位精度和忠实度。 对来自5个ImageNet模型的12种归因方法进行了广泛的评估,结果显示我们的认证归因具有稳健性、可解释性和忠实度,在下游任务中能够可靠地使用。代码可在提供的URL获取。
https://arxiv.org/abs/2506.15499
Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model's reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample's deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.
确保可靠性在深度学习中至关重要,尤其是在医学影像领域,诊断决策往往依赖于模型的输出结果。区分出“非分布数据”(OOD)样本的能力已被证明是衡量模型可靠性的宝贵指标。在医学影像学中,这一点尤为重要,因为识别OOD输入可以帮助标记潜在异常情况,这些异常情况可能被忽略。 虽然许多OOD检测方法依赖于特征空间或对数几率空间的表示形式,但最近的研究表明,这些方法可能无法完全捕捉OOD数据的多样性。为了解决这个问题,我们提出了一种新的OOD评分机制,称为NERO(Neuron-level Relevance for Out-of-Distribution),该机制利用了在特征层面上神经元级别的相关性。具体而言,我们将每个“内分布”(ID)类别的神经元级别相关性进行聚类以形成代表性中心,并引入相关距离度量来量化新样本相对于这些中心的偏离程度,从而提高OOD数据的可区分性。 此外,我们通过在偏差项中加入缩放后的相关性和组合特征范数来改进性能。我们的框架还支持解释性的OOD检测。我们在胃肠道影像基准数据集Kvasir和GastroVision上验证了其有效性,并且在最先进的OOD检测方法上取得了改进。
https://arxiv.org/abs/2506.15404
In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.
在半导体行业中,由于需求高但竞争也十分激烈且日益加剧,市场进入时间和产品质量是确保在各种应用领域中占有显著市场份额的关键因素。近年来,深度学习方法在计算机视觉、工业4.0和5.0应用(如缺陷分类)等领域取得了显著成功。特别是域适应(Domain Adaptation, DA)技术已证明非常有效,因为它专注于利用一个特定领域的知识来调整并使其能在另一个相关但不同的领域中有效运作。通过提高鲁棒性和可扩展性,DA减少了对广泛的重新标注或模型再训练的需求,这不仅降低了计算和资源成本,还使人类专家能够将精力集中在高价值的任务上。 因此,在半导体行业的背景下,我们测试了域适应技术在半监督和无监督设置中的有效性。此外,我们提出了一种名为DBACS的方法,这是一种受到CycleGAN启发的模型,并增加了额外的损失项以提高性能。所有这些方法都基于真实世界的电子显微镜图像进行研究和验证,在无监督和半监督设置中证明了我们的方法在推进半导体领域域适应技术方面的实用性。
https://arxiv.org/abs/2506.15260
Augmented reality (AR) offers immersive interaction but remains inaccessible for users with motor impairments or limited dexterity due to reliance on precise input methods. This study proposes a gesture-based interaction system for AR environments, leveraging deep learning to recognize hand and body gestures from wearable sensors and cameras, adapting interfaces to user capabilities. The system employs vision transformers (ViTs), temporal convolutional networks (TCNs), and graph attention networks (GATs) for gesture processing, with federated learning ensuring privacy-preserving model training across diverse users. Reinforcement learning optimizes interface elements like menu layouts and interaction modes. Experiments demonstrate a 20% improvement in task completion efficiency and a 25% increase in user satisfaction for motor-impaired users compared to baseline AR systems. This approach enhances AR accessibility and scalability. Keywords: Deep learning, Federated learning, Gesture recognition, Augmented reality, Accessibility, Human-computer interaction
增强现实(AR)提供了沉浸式的互动体验,但由于依赖于精确的输入方法,对于运动障碍或灵巧度受限的用户来说仍然难以使用。本研究提出了一种基于手势的交互系统,用于增强现实环境,该系统利用深度学习从穿戴式传感器和摄像头中识别手部和身体动作,并根据用户的实际能力调整界面。 该系统采用视觉变换器(ViT)、时间卷积网络(TCN)以及图注意力网络(GAT)处理手势数据。通过联邦学习在不同用户之间进行隐私保护的模型训练,确保了系统的可扩展性和安全性。此外,强化学习用于优化诸如菜单布局和互动模式等界面元素。 实验结果显示,在任务完成效率上,与基准增强现实系统相比,该系统使运动障碍用户的性能提高了20%,并且其满意度也增加了25%。这一方法显著增强了AR的可访问性及扩展性。 关键词:深度学习、联邦学习、手势识别、增强现实、无障碍设计、人机交互
https://arxiv.org/abs/2506.15189
Multi-parametric magnetic resonance imaging (mpMRI) exams have various series types acquired with different imaging protocols. The DICOM headers of these series often have incorrect information due to the sheer diversity of protocols and occasional technologist errors. To address this, we present a deep learning-based classification model to classify 8 different body mpMRI series types so that radiologists read the exams efficiently. Using mpMRI data from various institutions, multiple deep learning-based classifiers of ResNet, EfficientNet, and DenseNet are trained to classify 8 different MRI series, and their performance is compared. Then, the best-performing classifier is identified, and its classification capability under the setting of different training data quantities is studied. Also, the model is evaluated on the out-of-training-distribution datasets. Moreover, the model is trained using mpMRI exams obtained from different scanners in two training strategies, and its performance is tested. Experimental results show that the DenseNet-121 model achieves the highest F1-score and accuracy of 0.966 and 0.972 over the other classification models with p-value$<$0.05. The model shows greater than 0.95 accuracy when trained with over 729 studies of the training data, whose performance improves as the training data quantities grew larger. On the external data with the DLDS and CPTAC-UCEC datasets, the model yields 0.872 and 0.810 accuracy for each. These results indicate that in both the internal and external datasets, the DenseNet-121 model attains high accuracy for the task of classifying 8 body MRI series types.
多参数磁共振成像(mpMRI)检查包含多种不同成像协议获取的序列类型。由于各种不同的协议以及技术员偶尔的操作错误,这些序列的DICOM头部信息常常出现不准确的情况。为此,我们提出了一种基于深度学习的分类模型,用于对8种不同的身体mpMRI序列类型进行分类,以提高放射科医生阅读检查结果的效率。 使用来自不同机构的mpMRI数据集,我们训练了ResNet、EfficientNet和DenseNet三种基于深度学习的分类器,并对其在分类八种不同MRI序列上的性能进行了比较。随后,确定表现最佳的分类器,在不同训练数据量设置下研究其分类能力,并在外部分布的数据集上对其进行评估。此外,我们还采用来自不同扫描仪的不同策略对模型进行训练,并测试了其性能。 实验结果显示,DenseNet-121模型在所有其他分类模型中取得了最高的F1分数和准确率,分别为0.966和0.972(p值<0.05)。当使用超过729项研究的训练数据进行训练时,该模型显示出大于0.95的精度,并且随着训练数据量的增长,其性能进一步提升。在外部分布的数据集DLDS和CPTAC-UCEC上,该模型分别获得了0.872和0.810的准确率。 这些结果表明,在内部和外部数据集中,DenseNet-121模型在分类八种身体MRI序列类型的任务中都达到了很高的准确性。
https://arxiv.org/abs/2506.15182
Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.
语音增强,尤其是降噪技术,在改善真实世界应用场景中语音信号的可懂度和质量方面至关重要,尤其是在噪音环境中。尽管此前的研究已经提出了各种用于此目的的深度学习模型,但许多模型在噪声抑制、感知质量以及说话人特定特征保留之间难以取得平衡,留下了比较性能评估中的一个重要研究缺口。本研究对Wave-U-Net、CMGAN和U-Net这三种最先进的模型,在SpEAR、VPQAD和Clarkson数据集等多样化数据集上进行了基准测试。这些模型因其在文献中的相关性和代码可获取性而被选中进行研究。 评价结果表明,U-Net在噪声抑制方面表现出色,在SpEAR数据集上的信噪比(SNR)提高了71.96%,VPQAD数据集上提高了64.83%,Clarkson数据集上则提高了364.2%。CMGAN模型在感知质量方面表现优异,分别在SpEAR和VPQAD数据集中获得了最高的PESQ评分4.04和1.46,使其非常适合需要自然且易于理解的语音的应用场景。Wave-U-Net模型在保留说话人特定特征的同时也实现了噪声抑制方面的改进,这体现在VeriSpeak评分上的提升:在SpEAR数据集上提高了10.84%,VPQAD数据集上则提升了27.38%。 这项研究揭示了先进方法如何优化噪声抑制、感知质量和说话人识别之间的权衡。该研究的发现可能会推进语音生物识别技术、法医音频分析、电信通讯和在复杂声学条件下的说话人验证等领域的发展。
https://arxiv.org/abs/2506.15000
The integration of multi-modal Magnetic Resonance Imaging (MRI) and clinical data holds great promise for enhancing the diagnosis of neurological disorders (NDs) in real-world clinical settings. Deep Learning (DL) has recently emerged as a powerful tool for extracting meaningful patterns from medical data to aid in diagnosis. However, existing DL approaches struggle to effectively leverage multi-modal MRI and clinical data, leading to suboptimal performance. To address this challenge, we utilize a unique, proprietary multi-modal clinical dataset curated for ND research. Based on this dataset, we propose a novel transformer-based Mixture-of-Experts (MoE) framework for ND classification, leveraging multiple MRI modalities-anatomical (aMRI), Diffusion Tensor Imaging (DTI), and functional (fMRI)-alongside clinical assessments. Our framework employs transformer encoders to capture spatial relationships within volumetric MRI data while utilizing modality-specific experts for targeted feature extraction. A gating mechanism with adaptive fusion dynamically integrates expert outputs, ensuring optimal predictive performance. Comprehensive experiments and comparisons with multiple baselines demonstrate that our multi-modal approach significantly enhances diagnostic accuracy, particularly in distinguishing overlapping disease states. Our framework achieves a validation accuracy of 82.47\%, outperforming baseline methods by over 10\%, highlighting its potential to improve ND diagnosis by applying multi-modal learning to real-world clinical data.
多模式磁共振成像(MRI)与临床数据的结合在真实世界临床环境中提高神经疾病诊断能力方面具有巨大潜力。深度学习(DL)最近作为从医学数据中提取有意义模式以辅助诊断的强大工具而崭露头角。然而,现有的深度学习方法难以有效利用多种模态的MRI和临床数据,导致性能不佳。为了解决这一挑战,我们采用了一套独特且专有的多模态临床数据集,该数据集经过精心整理用于神经疾病研究。基于此数据集,我们提出了一种新型的、基于变压器的专家混合(MoE)框架,用于神经疾病的分类。我们的方法结合了多种MRI模式——解剖学(aMRI)、扩散张量成像(DTI)和功能性(fMRI),以及临床评估。该框架使用变压器编码器来捕捉体积MRI数据中的空间关系,并利用特定模态的专家进行目标特征提取。一种具有自适应融合功能的门控机制动态整合了各个专家输出,确保预测性能最优。全面的实验及与多个基线方法的比较表明,我们的多模式方法显著提高了诊断准确性,特别是在区分重叠疾病状态方面有突出表现。我们框架在验证集上的准确率为82.47%,优于所有基准方法超过10%,这凸显了其通过应用于真实世界临床数据的多模态学习来改善神经疾病诊断的巨大潜力。
https://arxiv.org/abs/2506.14970
While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.
尽管机器学习(ML)和深度学习(DL)模型在糖尿病预测中得到了广泛应用,但大型语言模型(LLMs)在处理结构化数值数据方面的应用仍鲜有探索。本研究旨在测试使用零样本、单样本和三样本提示方法的LLM在糖尿病预测中的有效性。我们采用Pima印第安人糖尿病数据库(PIDD)进行实证分析,并评估了六种LLM,包括四种开源模型:Gemma-2-27B、Mistral-7B、Llama-3.1-8B和Llama-3.2-2B。此外还测试了两种专有模型:GPT-4o 和 Gemini Flash 2.0。我们还将这六种LLM与三种传统机器学习模型(随机森林、逻辑回归和支持向量机(SVM))的性能进行了比较。我们使用准确率、精确度、召回率和F1值作为评估指标。 我们的研究结果表明,专有LLM在少样本设置中表现优于开源模型,其中GPT-4o 和 Gemma-2-27B 达到了最高的准确率。值得注意的是,Gemma-2-27B 在 F1 值方面也超越了传统机器学习模型的性能。然而,仍然存在一些问题,如提示策略不同导致的表现差异和需要特定领域的微调需求。 这项研究表明LLM在医疗预测任务中具有一定的实用性,并鼓励未来的研究关注于改进提示工程以及开发混合方法来进一步提升医疗预测的效果。
https://arxiv.org/abs/2506.14949
Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.
区分由夸克和胶子引发的喷注是高能物理学中的一个关键且具有挑战性的任务,对于提高大型强子对撞机(LHC)上新物理搜索和精密测量的质量至关重要。尽管深度学习技术——特别是卷积神经网络(CNNs)——已经通过基于图像的表现形式推进了喷注分类的研究,但视觉变压器(ViT)架构在直接使用现实探测器和堆积条件下进行能量计图像分析中的潜力仍未得到充分探索,尤其是在模拟全局上下文信息方面。本文系统地评估了使用2012年CMS开放数据对夸克-胶子喷注分类的ViT及其与CNN混合模型的效果。我们从检测级别的能量沉积(ECAL、HCAL)和重建轨迹构建多通道喷注视图图像,从而实现端到端的学习方法。我们的全面基准测试表明,基于ViT的模型——特别是ViT+MaxViT和ViT+ConvNeXt混合体,在F1分数、ROC-AUC和准确率方面始终优于传统的CNN基线模型,突显了捕捉喷注亚结构中长距离空间相关性的优势。这项工作首次建立了使用公共对撞机数据将ViT架构应用于能量计图像基础喷注分类的系统框架,并为该领域的进一步深度学习研究提供了一个有组织的数据集和性能基准。
https://arxiv.org/abs/2506.14934
Microearthquakes (MEQs) generated by subsurface fluid injection record the evolving stress state and permeability of reservoirs. Forecasting their full spatiotemporal evolution is therefore critical for applications such as enhanced geothermal systems (EGS), CO$_2$ sequestration and other geo-engineering applications. We present a transformer-based deep learning model that ingests hydraulic stimulation history and prior MEQ observations to forecast four key quantities: cumulative MEQ count, cumulative logarithmic seismic moment, and the 50th- and 95th-percentile extents ($P_{50}, P_{95}$) of the MEQ cloud. Applied to the EGS Collab Experiment 1 dataset, the model achieves $R^2 >0.98$ for the 1-second forecast horizon and $R^2 >0.88$ for the 15-second forecast horizon across all targets, and supplies uncertainty estimates through a learned standard deviation term. These accurate, uncertainty-quantified forecasts enable real-time inference of fracture propagation and permeability evolution, demonstrating the strong potential of deep-learning approaches to improve seismic-risk assessment and guide mitigation strategies in future fluid-injection operations.
微地震(MEQs)是由地下流体注入生成的,它们记录了储层应力状态和渗透率的变化。因此,预测其时空演变对于增强地热系统(EGS)、二氧化碳封存和其他地球工程技术等应用至关重要。我们提出了一种基于变压器的深度学习模型,该模型可以摄入水力刺激历史和先前的MEQ观测数据,以预测四个关键参数:累积微地震次数、累计对数地震矩以及微震云团的第50百分位和第95百分位范围($P_{50}, P_{95}$)。当应用于EGS Collab实验1的数据集时,该模型在1秒预报时间范围内所有目标上的$R^2> 0.98$,在15秒预报时间范围内所有目标上的$R^2 >0.88$,并通过学习的标准偏差项提供不确定性估计。这些精确且量化了不确定性的预测使得可以实时推断裂缝扩展和渗透率的变化,并展示了深度学习方法在未来流体注入操作中改善地震风险评估和指导缓解策略方面的巨大潜力。
https://arxiv.org/abs/2506.14923
Cone-beam X-ray computed tomography (XCT) is an essential imaging technique for generating 3D reconstructions of internal structures, with applications ranging from medical to industrial imaging. Producing high-quality reconstructions typically requires many X-ray measurements; this process can be slow and expensive, especially for dense materials. Recent work incorporating artifact reduction priors within a plug-and-play (PnP) reconstruction framework has shown promising results in improving image quality from sparse-view XCT scans while enhancing the generalizability of deep learning-based solutions. However, this method uses a 2D convolutional neural network (CNN) for artifact reduction, which captures only slice-independent information from the 3D reconstruction, limiting performance. In this paper, we propose a PnP reconstruction method that uses a 2.5D artifact reduction CNN as the prior. This approach leverages inter-slice information from adjacent slices, capturing richer spatial context while remaining computationally efficient. We show that this 2.5D prior not only improves the quality of reconstructions but also enables the model to directly suppress commonly occurring XCT artifacts (such as beam hardening), eliminating the need for artifact correction pre-processing. Experiments on both experimental and synthetic cone-beam XCT data demonstrate that the proposed method better preserves fine structural details, such as pore size and shape, leading to more accurate defect detection compared to 2D priors. In particular, we demonstrate strong performance on experimental XCT data using a 2.5D artifact reduction prior trained entirely on simulated scans, highlighting the proposed method's ability to generalize across domains.
锥束X射线计算机断层扫描(XCT)是一种重要的成像技术,用于生成内部结构的三维重建,在医学和工业成像等领域有广泛的应用。高质量的重建通常需要大量的X射线测量;这一过程可能缓慢且昂贵,尤其是在处理密度较高的材料时。最近的研究通过在插件即用(PnP)重构框架中引入减少伪影的先验方法,从稀疏视图XCT扫描中改善了图像质量,并提高了基于深度学习解决方案的泛化能力。然而,这种方法使用二维卷积神经网络(CNN)来减少伪影,只能捕获三维重建中的片层独立信息,从而限制了性能。 在这篇论文中,我们提出了一种PnP重构方法,该方法采用2.5D减伪影CNN作为先验。这种策略利用了相邻切片之间的相互关系,捕捉更丰富的空间上下文,同时保持计算效率。我们证明这种方法不仅提高了重建的质量,还使模型能够直接抑制常见的XCT伪影(如硬射线效应),从而消除了对预处理以纠正这些伪影的需求。 在实验和合成的锥束XCT数据上的实验证明了所提出的方法比使用二维先验更能保留精细结构细节,例如孔径大小和形状,这导致缺陷检测更为准确。特别是,我们展示了使用仅通过模拟扫描训练的2.5D减伪影先验模型,在实验性XCT数据上具有很强的表现力,突显了该方法跨领域泛化的潜力。
https://arxiv.org/abs/2506.14719
Density Functional Theory (DFT) is the most widely used electronic structure method for predicting the properties of molecules and materials. Although DFT is, in principle, an exact reformulation of the Schrödinger equation, practical applications rely on approximations to the unknown exchange-correlation (XC) functional. Most existing XC functionals are constructed using a limited set of increasingly complex, hand-crafted features that improve accuracy at the expense of computational efficiency. Yet, no current approximation achieves the accuracy and generality for predictive modeling of laboratory experiments at chemical accuracy -- typically defined as errors below 1 kcal/mol. In this work, we present Skala, a modern deep learning-based XC functional that bypasses expensive hand-designed features by learning representations directly from data. Skala achieves chemical accuracy for atomization energies of small molecules while retaining the computational efficiency typical of semi-local DFT. This performance is enabled by training on an unprecedented volume of high-accuracy reference data generated using computationally intensive wavefunction-based methods. Notably, Skala systematically improves with additional training data covering diverse chemistry. By incorporating a modest amount of additional high-accuracy data tailored to chemistry beyond atomization energies, Skala achieves accuracy competitive with the best-performing hybrid functionals across general main group chemistry, at the cost of semi-local DFT. As the training dataset continues to expand, Skala is poised to further enhance the predictive power of first-principles simulations.
密度泛函理论(DFT)是预测分子和材料性质最广泛使用的一种电子结构方法。尽管原则上,DFT是对薛定谔方程的精确重新表述,但在实际应用中却依赖于对未知交换关联(XC)函数近似的计算。现有的大多数XC功能采用有限的手工设计特征集合构造而成,这些特征随着复杂度的增加而改进准确性,但牺牲了计算效率。然而,目前没有任何一种近似方法能够实现实验室实验预测所需的准确性和通用性——通常定义为误差低于1 kcal/mol。 在本工作中,我们介绍了Skala,这是一种现代基于深度学习构建的XC功能,它通过直接从数据中学习表示来绕过了昂贵的手工设计特征。Skala能够在保持半局域DFT典型计算效率的同时,实现小分子原子化能量上的化学精度。这一性能得益于使用了前所未有的大量由计算密集型波函数方法生成的高精度参考数据进行训练。值得注意的是,随着涵盖多样化化学特性的额外训练数据的增加,Skala系统性地提高了准确性。 通过引入适量针对超出原子化能量范围的化学特性的高质量补充数据,Skala在一般主族化学领域达到了与最佳混合功能相当的精确度,而其计算成本仅为半局域DFT水平。随着训练数据集继续扩大,Skala有望进一步增强第一性原理模拟的预测能力。
https://arxiv.org/abs/2506.14665
Over the last decade, as we rely more on deep learning technologies to make critical decisions, concerns regarding their safety, reliability and interpretability have emerged. We introduce a novel Neural Argumentative Learning (NAL) architecture that integrates Assumption-Based Argumentation (ABA) with deep learning for image analysis. Our architecture consists of neural and symbolic components. The former segments and encodes images into facts using object-centric learning, while the latter applies ABA learning to develop ABA frameworks enabling predictions with images. Experiments on synthetic data show that the NAL architecture can be competitive with a state-of-the-art alternative.
在过去十年里,随着我们越来越依赖深度学习技术来做出关键决策,关于这些技术的安全性、可靠性和可解释性的担忧也随之出现。我们提出了一种新颖的神经论证学习(NAL)架构,该架构将基于假设的论证方法(ABA)与深度学习相结合用于图像分析。我们的架构由神经和符号两个部分组成。前者使用以对象为中心的学习方式对图像进行分割并编码成事实,而后者则应用ABA学习来开发能够支持基于图像预测的ABA框架。在合成数据上的实验表明,NAL架构可以与最先进的替代方法相媲美。
https://arxiv.org/abs/2506.14577
Holographic displays have significant potential in virtual reality and augmented reality owing to their ability to provide all the depth cues. Deep learning-based methods play an important role in computer-generated holograms (CGH). During the diffraction process, each pixel exerts an influence on the reconstructed image. However, previous works face challenges in capturing sufficient information to accurately model this process, primarily due to the inadequacy of their effective receptive field (ERF). Here, we designed complex-valued deformable convolution for integration into network, enabling dynamic adjustment of the convolution kernel's shape to increase flexibility of ERF for better feature extraction. This approach allows us to utilize a single model while achieving state-of-the-art performance in both simulated and optical experiment reconstructions, surpassing existing open-source models. Specifically, our method has a peak signal-to-noise ratio that is 2.04 dB, 5.31 dB, and 9.71 dB higher than that of CCNN-CGH, HoloNet, and Holo-encoder, respectively, when the resolution is 1920$\times$1072. The number of parameters of our model is only about one-eighth of that of CCNN-CGH.
全息显示器在虚拟现实和增强现实中具有巨大的潜力,这得益于它们能够提供所有深度线索的能力。基于深度学习的方法在计算机生成的全息图(CGH)中扮演着重要角色。在衍射过程中,每个像素都会对重建图像产生影响。然而,先前的研究面临的一个挑战是难以捕捉足够信息以准确建模这一过程,主要原因是其有效感受野(ERF)的不足。为此,我们设计了一种复值可变形卷积,并将其集成到网络中,允许动态调整卷积核的形状来增加ERF的灵活性,从而更好地提取特征。这种方法使我们在模拟和光学实验重建中能够使用单一模型同时实现最先进的性能,并超越现有的开源模型。 具体而言,当分辨率分别为1920×1072时,我们提出的方法在峰值信噪比(PSNR)方面分别比CCNN-CGH、HoloNet和Holo-encoder高出2.04 dB、5.31 dB 和 9.71 dB。我们的模型参数数量仅是CCNN-CGH的约八分之一。
https://arxiv.org/abs/2506.14542
This paper introduces a novel neural network framework called M2BeamLLM for beam prediction in millimeter-wave (mmWave) massive multi-input multi-output (mMIMO) communication systems. M2BeamLLM integrates multi-modal sensor data, including images, radar, LiDAR, and GPS, leveraging the powerful reasoning capabilities of large language models (LLMs) such as GPT-2 for beam prediction. By combining sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT), M2BeamLLM achieves significantly higher beam prediction accuracy and robustness, demonstrably outperforming traditional deep learning (DL) models in both standard and few-shot scenarios. Furthermore, its prediction performance consistently improves with increased diversity in sensing modalities. Our study provides an efficient and intelligent beam prediction solution for vehicle-to-infrastructure (V2I) mmWave communication systems.
本文介绍了一种新颖的神经网络框架M2BeamLLM,用于毫米波(mmWave)大规模多输入多输出(mMIMO)通信系统的波束预测。M2BeamLLM整合了多种模式的传感器数据,包括图像、雷达、激光雷达和GPS信息,并利用大型语言模型(如GPT-2)的强大推理能力进行波束预测。通过结合传感数据编码、跨模态对齐与融合以及监督微调(SFT),M2BeamLLM显著提高了波束预测的准确性和鲁棒性,在标准场景和少量样本场景下均优于传统的深度学习模型。此外,随着感知模式多样性的增加,其预测性能持续提升。我们的研究为车辆到基础设施(V2I)毫米波通信系统提供了一种高效且智能的波束预测解决方案。
https://arxiv.org/abs/2506.14532
Background: Accurate lesion segmentation is critical for multiple sclerosis (MS) diagnosis, yet current deep learning approaches face robustness challenges. Aim: This study improves MS lesion segmentation by combining data fusion and deep learning techniques. Materials and Methods: We suggested novel radiomic features (concentration rate and Rényi entropy) to characterize different MS lesion types and fused these with raw imaging data. The study integrated radiomic features with imaging data through a ResNeXt-UNet architecture and attention-augmented U-Net architecture. Our approach was evaluated on scans from 46 patients (1102 slices), comparing performance before and after data fusion. Results: The radiomics-enhanced ResNeXt-UNet demonstrated high segmentation accuracy, achieving significant improvements in precision and sensitivity over the MRI-only baseline and a Dice score of 0.774$\pm$0.05; p<0.001 according to Bonferroni-adjusted Wilcoxon signed-rank tests. The radiomics-enhanced attention-augmented U-Net model showed a greater model stability evidenced by reduced performance variability (SDD = 0.18 $\pm$ 0.09 vs. 0.21 $\pm$ 0.06; p=0.03) and smoother validation curves with radiomics integration. Conclusion: These results validate our hypothesis that fusing radiomics with raw imaging data boosts segmentation performance and stability in state-of-the-art models.
背景:准确的病变分割对于多发性硬化症(MS)的诊断至关重要,然而目前的深度学习方法在鲁棒性方面面临挑战。目的:本研究通过结合数据融合和深度学习技术来改进MS病变分割。 材料与方法:我们提出了新颖的放射组学特征(浓度率和Rényi熵),用于表征不同类型的MS病灶,并将这些特征与原始影像数据进行融合。这项研究整合了放射组学特征和影像数据,通过ResNeXt-UNet架构和增强注意力的U-Net架构实现。我们的方法在46名患者(1102个切片)的扫描上进行了评估,在数据融合前后对比性能表现。 结果:放射组学增强的ResNeXt-UNet展示了高分割准确性,与仅使用MRI基线相比,在精度和敏感度方面取得了显著提升,并且Dice得分达到了0.774±0.05;根据Bonferroni校正后的Wilcoxon符号秩检验结果,p<0.001。放射组学增强的注意力增强U-Net模型通过减少性能变异性(SDD = 0.18±0.09 对比 0.21±0.06; p=0.03)和结合放射组学后验证曲线更加平滑,显示出了更大的模型稳定性。 结论:这些结果证实了我们的假设,即通过将放射组学与原始影像数据融合可以提升最先进的模型的分割性能和稳定性。
https://arxiv.org/abs/2506.14524
Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at this https URL.
面部微表情识别(MER)是一个具有挑战性的问题,由于短暂和细微的微表情动作。大多数现有方法依赖于手工设计特征、关键帧如起始点、顶峰点及结束点,或受制于小规模且低多样性数据集的深度网络。在本文中,我们提出了一种结合了变压器、图卷积以及普通卷积优势的端到端微动作感知深度学习框架。 特别地,我们提出了一种新颖的F5C(全连接卷积与通道对应卷积组成)模块,可以直接从一系列原始帧序列提取局部-全局特征,而无需事先知道关键帧的信息。提出的变压器风格的全连接卷积旨在提取局部特征的同时保持全局感受野,图样式通道对应卷积则被引入以建模特征模式之间的相关性。 此外,MER、光流估计和面部标志点检测通过共享局部-全局特征进行联合训练。后两项任务有助于捕捉对微表情识别有用的细微面部动作信息,从而缓解因训练数据不足造成的影响。 广泛的实验表明,我们的框架(i)在CASME II、SAMM 和 SMIC 评估基准上超越了最先进的MER方法,(ii)对于光流估计和面部标志点检测表现出良好的性能,并且(iii)能够捕捉与微表情相关的局部区域内的细微肌肉动作。代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2506.14511