Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。
https://arxiv.org/abs/2505.16991
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。
https://arxiv.org/abs/2505.16932
Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.
在过去十年中,视频在雨天下的修复技术取得了显著进展,这主要得益于深度学习的进步。然而,依赖配对数据的现有方法难以有效地泛化到真实世界场景中,主要原因在于合成与实际雨效果之间的差异。为了克服这些限制,我们提出了一种双分支时空状态空间模型,旨在提高视频序列中雨迹去除的效果。具体来说,我们设计了用于提取空间特征的空间状态空间模型层和利用帧间时间依赖性的时态状态空间模型层。 为了改进多帧特征融合,我们推导出一种动态堆叠滤波器,该滤波器能够自适应地逼近统计滤波器,并实现更优的逐像素特征细化。此外,我们开发了一种中值堆叠损失函数,利用雨稀疏先验生成伪干净补丁,以支持半监督学习。 为了进一步探索去雨模型在其他基于视觉的任务中的应用能力(特别是在雨天环境下),我们引入了一个新的真实世界基准测试平台,专注于雨天下的物体检测和跟踪任务。我们的方法经过了多个包含大量合成与实际雨视频的基准数据集的全面评估,并在定量指标、视觉质量、效率及下游任务实用性方面均表现出优越性。
https://arxiv.org/abs/2505.16811
Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.
深度学习在计算机视觉领域取得了变革性的进展,但其高度依赖于大规模的标注数据集和计算资源。迁移学习特别是对预训练模型进行微调提供了实际可行的替代方案;然而,基于自然图像数据集(如ImageNet)预训练的模型可能无法捕捉到医学影像中的特定领域特征。这项研究提出了一种无监督学习框架,用于提取高价值的皮肤科特征,而不是单纯依赖于基于ImageNet的预训练。我们使用一个从头开始在专有的皮肤科数据集上训练的变分自编码器(VAE),使模型能够学习到结构化且临床相关的潜在空间。随后,将这种自我监督的特征提取器与在同一分类条件下使用的基于ImageNet预训练的骨干网络进行比较,突出了一般用途和特定领域预训练之间的权衡。 研究结果揭示了不同的学习模式。自我监督模型实现了最终验证损失为0.110(-33.33%),而基于ImageNet预训练的模型则停滞在0.100(-16.67%),这表明后者出现了过拟合现象。准确性趋势也证实了这一点:自我监督模型从45%提高到了65%(+44.44%),其过拟合差距接近于零;而基于ImageNet预训练的模型则达到了87%(+50.00%),但最终稳定在75%(+19.05%),其过拟合差距增加到了+0.060。这些发现表明,尽管基于ImageNet的预训练加速了收敛过程,但它也放大了对非临床相关特征的过拟合现象。相比之下,自我监督学习实现了持续改进、更强泛化能力和更佳适应性,在医学影像中强调特定领域特征提取的重要性。
https://arxiv.org/abs/2505.16773
Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.
电池对于电动汽车和可再生能源存储等各类应用至关重要,因此安全性和效率成为了关键问题。在电池热图像中进行异常检测有助于提前发现故障,但传统的深度学习方法需要大量标注数据,这些数据由于安全性风险及高昂的数据采集成本而难以获得。为解决这一难题,我们探索了利用视觉问答(VQA)模型进行零样本异常检测的方法,这种方法通过使用预训练的知识和基于文本的提示来在不同的视觉任务中实现泛化。结合正常的电池热行为先验知识,我们设计出可以不依赖于特定电池数据训练的提示以识别异常。 我们在三个VQA模型(ChatGPT-4o、LLaVa-13b 和 BLIP-2)上进行了评估,分析了它们对不同提示变化的鲁棒性以及重复实验的结果和定性输出。尽管这些模型没有针对电池数据进行微调,但我们的方法展示了与最先进的已训练电池数据的模型相比具有竞争力的表现。本研究结果突显了基于VQA的零样本学习在电池异常检测中的潜力,并提出了未来改进其有效性的方向。
https://arxiv.org/abs/2505.16674
Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal processing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed based on current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) modules. This architecture is designed to capture both local signal characteristics and long-range temporal dependencies while modeling the continuous-time dynamics of battery degradation. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available large-scale datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its strong potential for real-world RUL prediction applications.
准确预测剩余使用寿命(RUL)对于及时维护锂离子电池至关重要,这会影响依赖这些电池的电动应用的操作效率。本文提出了一种基于最近充放电循环数据来估算剩余可用循环数的RUL预测方法。该方法引入了一个新颖的信号处理管道和一个深度学习预测模型。 在信号预处理管道中,根据电流和容量信号计算出衍生容量特征。与原始容量、电压和电流一起,这些特征通过统计指标和基于增量的方法进行去噪和增强,以捕捉当前循环与前一循环之间的差异。 在预测模型中,经过处理的特征被输入到一个混合深度学习架构中,该架构由1D卷积神经网络(CNN)、注意力长短期记忆(A-LSTM)模块以及基于常微分方程的LSTM(ODE-LSTM)模块组成。这种架构设计旨在捕获局部信号特征和长期时间依赖关系,并建模电池退化过程中的连续时间动态。 该模型通过不同的学习策略和目标数据分区场景进行迁移学习进一步进行了评估,结果表明即使在有限的目标数据上微调的情况下也能保持稳健的性能。 实验结果基于两个公开的大规模数据集,证明了所提出的方法优于深度学习基线方法和机器学习技术,在RMSE(均方根误差)方面取得了101.59的成绩,突显其在实际RUL预测应用中的强大潜力。
https://arxiv.org/abs/2505.16664
Accurate and efficient quantification of cardiac function is essential for the estimation of prognosis of cardiovascular diseases (CVDs). One of the most commonly used metrics for evaluating cardiac pumping performance is left ventricular ejection fraction (LVEF). However, LVEF can be affected by factors such as inter-observer variability and varying pre-load and after-load conditions, which can reduce its reproducibility. Additionally, cardiac dysfunction may not always manifest as alterations in LVEF, such as in heart failure and cardiotoxicity diseases. An alternative measure that can provide a relatively load-independent quantitative assessment of myocardial contractility is myocardial strain and strain rate. By using LVEF in combination with myocardial strain, it is possible to obtain a thorough description of cardiac function. Automated estimation of LVEF and other volumetric measures from cine-MRI sequences can be achieved through segmentation models, while strain calculation requires the estimation of tissue displacement between sequential frames, which can be accomplished using registration models. These tasks are often performed separately, potentially limiting the assessment of cardiac function. To address this issue, in this study we propose an end-to-end deep learning (DL) model that jointly estimates groupwise (GW) registration and segmentation for cardiac cine-MRI images. The proposed anatomically-guided Deep GW network was trained and validated on a large dataset of 4-chamber view cine-MRI image series of 374 subjects. A quantitative comparison with conventional GW registration using elastix and two DL-based methods showed that the proposed model improved performance and substantially reduced computation time.
心脏功能的准确和高效量化对于心血管疾病(CVD)预后的评估至关重要。评价心脏泵血性能最常用的指标之一是左心室射血分数(LVEF)。然而,LVEF可能会受到如观察者间变异性和不同前后负荷条件等因素的影响,这可能降低其可重复性。此外,并非所有的心脏功能障碍都会表现为LVEF的变化,例如在心脏衰竭和心毒性疾病中。另一种能够提供相对独立于负荷的定量评估心肌收缩力的指标是心肌应变和应变速率。通过结合使用LVEF与心肌应变,可以全面描述心脏的功能状态。 自动从电影磁共振成像(cine-MRI)序列估算LVEF和其他容积测量可以通过分割模型实现,而应变计算则需要估计连续帧之间组织的位移,这可以通过配准模型完成。这些任务通常分别进行,可能会限制对心脏功能评估的效果。为解决这一问题,在本研究中我们提出了一种端到端深度学习(DL)模型,该模型能够同时联合估算心脏cine-MRI图像的组内注册和分割。所提出的基于解剖学引导的深层组配网络在374名受试者的四腔心电影MRI序列的大数据集上进行了训练和验证。与使用Elastix的传统组内配准以及两种深度学习方法进行的定量比较显示,提出的方法提高了性能,并大幅减少了计算时间。
https://arxiv.org/abs/2505.16452
Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in safety accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D samples to address the challenges of intra-class diversity and environmental variations in real-world scenarios. Building upon this framework, we introduce an adversarial sample reality enhancement approach that incorporates non-rigid surface modeling and a realistic 3D matching mechanism. We compare with 5 advanced adversarial patches and evaluate their attack performance on 8 object detecotrs, including single-stage, two-stage, and transformer-based models. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Moreover, proposed method demonstrates excellent robustness and transferability under multi-angle attacks, varying lighting conditions, and different distance in the physical world. The demo video and code can be obtained at this https URL.
自主驾驶车辆是典型复杂的智能系统,其核心采用了人工智能技术。然而,基于深度学习的感知方法极其容易受到对抗样本的影响,这可能导致安全事故发生。如何在物理世界中生成有效的对抗示例并评估物体检测系统的性能是一个巨大的挑战。 在这项研究中,我们提出了一种统一的联合对抗训练框架,适用于二维和三维样本,以应对现实场景中的同类多样性及环境变化带来的挑战。基于此框架,我们引入了包含非刚性表面建模和真实3D匹配机制的对抗样本真实性增强方法。我们将我们的方法与5种先进的对抗补丁进行了比较,并评估它们在8种物体检测器上的攻击性能,这些检测器包括单阶段、双阶段以及基于Transformer模型。 数字环境和物理环境中的大量实验结果表明,我们提出的方法生成的对抗纹理能够有效地误导目标检测模型。此外,在多角度攻击、不同光照条件及距离变化下的实际环境中,该方法展示了出色的鲁棒性和迁移性。 演示视频和代码可在[此链接](https://example.com)获取。(注意:示例链接为虚构,请根据实际情况提供具体网址)
https://arxiv.org/abs/2505.16402
Quantum optimization is the most mature quantum computing technology to date, providing a promising approach towards efficiently solving complex combinatorial problems. Methods such as adiabatic quantum computing (AQC) have been employed in recent years on important optimization problems across various domains. In deep learning, deep neural networks (DNN) have reached immense sizes to support new predictive capabilities. Optimization of large-scale models is critical for sustainable deployment, but becomes increasingly challenging with ever-growing model sizes and complexity. While quantum optimization is suitable for solving complex problems, its application to DNN optimization is not straightforward, requiring thorough reformulation for compatibility with commercially available quantum devices. In this work, we explore the potential of adopting AQC for fine-grained pruning-quantization of convolutional neural networks. We rework established heuristics to formulate model compression as a quadratic unconstrained binary optimization (QUBO) problem, and assess the solution space offered by commercial quantum annealing devices. Through our exploratory efforts of reformulation, we demonstrate that AQC can achieve effective compression of practical DNN models. Experiments demonstrate that adiabatic quantum computing (AQC) not only outperforms classical algorithms like genetic algorithms and reinforcement learning in terms of time efficiency but also excels at identifying global optima.
量子优化是迄今为止最成熟的量子计算技术,为高效解决复杂的组合问题提供了一个有前景的方法。近年来,诸如绝热量子计算(AQC)等方法已被应用于各个领域的关键优化问题中。在深度学习领域,深层神经网络(DNN)已经达到了巨大的规模以支持新的预测能力。大规模模型的优化对于可持续部署至关重要,但随着模型大小和复杂性的不断增加,这一任务变得越来越具有挑战性。尽管量子优化适合解决复杂的问题,但将其应用于DNN优化并不直接可行,需要彻底重新制定以便与商用量子设备兼容。在本工作中,我们探索了将AQC用于卷积神经网络的精细修剪-量化的方法。我们将现有的启发式方法改造为二次无约束二进制优化(QUBO)问题,并评估商业量子退火装置提供的解空间。通过我们的重构尝试,证明了绝热量子计算可以实现实用DNN模型的有效压缩。实验表明,绝热量子计算不仅在时间效率上超过了诸如遗传算法和强化学习等经典算法,还擅长识别全局最优解。
https://arxiv.org/abs/2505.16332
Deep learning models trained on extensive Electronic Health Records (EHR) data have achieved high accuracy in diagnosis prediction, offering the potential to assist clinicians in decision-making and treatment planning. However, these models lack two crucial features that clinicians highly value: interpretability and interactivity. The ``black-box'' nature of these models makes it difficult for clinicians to understand the reasoning behind predictions, limiting their ability to make informed decisions. Additionally, the absence of interactive mechanisms prevents clinicians from incorporating their own knowledge and experience into the decision-making process. To address these limitations, we propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework that integrates personalized knowledge databases and agentic LLMs. II-KEA enhances interpretability through explicit reasoning and causal analysis, while also improving interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts. II-KEA is evaluated on both MIMIC-III and MIMIC-IV, demonstrating superior performance along with enhanced interpretability and interactivity, as evidenced by its strong results from extensive case studies.
基于广泛电子健康记录(EHR)数据训练的深度学习模型在诊断预测方面已经达到了很高的准确率,有潜力帮助临床医生进行决策和治疗规划。然而,这些模型缺少两个关键特性:可解释性和交互性。由于“黑盒”特性的存在,使得临床医生难以理解模型背后的推理过程,从而限制了他们做出知情决定的能力。此外,缺乏互动机制也阻碍了临床医生将其知识和经验融入决策过程中。 为了解决这些问题,我们提出了II-KEA框架,这是一种增强型的知识驱动的因果发现方法,它整合了个性化知识数据库和代理语言模型(LLM)。通过明确的推理和因果分析,II-KEA提高了模型的可解释性;同时,该框架还允许临床医生通过定制化的知识库和提示来注入他们的知识和经验,从而增强了交互性。在对MIMIC-III和MIMIC-IV数据集进行评估时,II-KEA不仅展示了卓越的表现,还在案例研究中证明了其增强的可解释性和互动性的有效性。
https://arxiv.org/abs/2505.16288
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
社交媒体和在线论坛的流行度日益增加,但不幸的是,这些平台也被用于散播仇恨言论。在这篇论文中,我们设计了黑盒技术来保护用户免受在线平台上仇恨言论的影响,通过生成可以误导基于深度学习的最佳仇恨言论检测模型的扰动,从而降低其效率。同时,我们确保对原始仇恨言论的意义进行最小程度的改动。我们的最佳扰动生成攻击能够成功地使86.8%的仇恨文本逃避被检测出来。
https://arxiv.org/abs/2505.16263
In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the 'hypersphere collapse' problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a 'single cluster' while simultaneously learning the cluster center and mapping data close to this center. Also, to improve expressiveness and enable effective single clustering, we propose a new 'One-directed Adaptive loss'. The optimization of this loss is mathematically proven. MADCluster consists of three main components: Base Embedder capturing high-dimensional temporal dynamics, Cluster Distance Mapping, and Sequence-wise Clustering for continuous center updates. Its model-agnostic characteristics are achieved by applying various architectures to the Base Embedder. Experiments on four time series benchmark datasets demonstrate that applying MADCluster improves the overall performance of comparative models. In conclusion, the compatibility of MADCluster shows potential for enhancing model performance across various architectures.
在这篇论文中,我们提出了一种新的模型无关异常检测框架MADCluster,该框架利用自监督聚类技术。MADCluster适用于各种深度学习架构,并解决了现有基于深度学习的异常检测方法中存在的“超球体坍塌”问题。其核心思想是将正常模式数据聚集到一个‘单一集群’中,同时学习这个集群中心并将数据映射到接近该中心的位置。为了提高表达能力和实现有效的单个聚类,我们还提出了一种新的“单向自适应损失函数”。这种损失的优化在数学上得到了证明。 MADCluster包含三个主要组件:捕捉高维时间动态的基本嵌入器(Base Embedder)、聚类距离映射和序列级聚类以进行连续中心更新。通过将不同的架构应用于基本嵌入器,其模型无关性得以实现。在四个时间序列基准数据集上的实验表明,应用MADCluster可以提高比较模型的整体性能。 总之,MADCluster的兼容性显示了它在各种架构中提升模型性能的巨大潜力。
https://arxiv.org/abs/2505.16223
Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.
利用未受限制的自然扰动进行对抗攻击对深度学习系统构成了严重的安全威胁,但由于生成的对抗特征与真实世界数据之间的分布差异,这些攻击在模型间传播的能力有限。尽管最近的工作开始使用预训练的扩散模型作为对抗先验知识,它们仍然面临理想对抗样本的分布与其学习到的自然图像分布之间差距所带来的挑战。为了解决这一问题,我们提出了通过潜在扩散生成可转移鲁棒对抗图像(TRAIL),这是一种测试时间适应框架,它使模型能够从具有对抗特征的图像分布中生成与目标图像相似度高的图片。 为了缓解这种分布偏移,在攻击过程中,TRAIL 通过结合对抗性目标(误导受害者模型)和感知约束(保持图像现实感)来更新扩散 U-Net 的权重。然后,经过调整的模型通过迭代噪声注入和去噪过程产生对抗样本,并以此为目标进行引导。 实验结果表明,TRAIL 在跨模型攻击传播能力方面显著优于现有最先进的方法,验证了分布对齐的对抗特征合成对于实际黑盒攻击至关重要。
https://arxiv.org/abs/2505.16166
Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at this https URL.
超高清(UHD)图像恢复旨在解决超高分辨率图像质量下降的问题。近年来,该领域的主要进展主要来自于基于深度学习的创新,包括数据集构建、网络架构、采样策略、先验知识整合和损失函数等方面的改进。本文系统地回顾了近期在UHD图像恢复领域的进步,涵盖了从数据集构建到算法设计等各个方面,为理解这一领域的前沿发展提供了宝贵的资源。 我们首先总结了几种不同图像恢复子问题的退化模型,例如超分辨率(Super-resolution)、低光增强(Low-light enhancement)、去模糊(Deblurring)、去雾(Dehazing)、除雨(Deraining)和除雪(Desnowing),并强调了它们在UHD图像恢复中应用的独特挑战。然后,我们展示了现有的UHD基准数据集,并根据退化类型和数据集构建方法对文献进行了分类整理。 接下来,我们将展示深度学习驱动的UHD图像恢复的主要里程碑,回顾恢复任务、技术发展以及现有方法的评估情况。此外,我们提出了一种基于网络架构和采样策略的分类框架,有助于清晰地组织现有的方法。最后,我们分享了当前研究领域的见解,并提出了进一步发展的方向。 有关本文的相关资源库可以在此网址访问:[相关链接](https://this-url.com/)(请将“this-url”替换为实际提供的URL)。
https://arxiv.org/abs/2505.16161
In recent years, deep learning-based Monocular Depth Estimation (MDE) models have been widely applied in fields such as autonomous driving and robotics. However, their vulnerability to backdoor attacks remains unexplored. To fill the gap in this area, we conduct a comprehensive investigation of backdoor attacks against MDE models. Typically, existing backdoor attack methods can not be applied to MDE models. This is because the label used in MDE is in the form of a depth map. To address this, we propose BadDepth, the first backdoor attack targeting MDE models. BadDepth overcomes this limitation by selectively manipulating the target object's depth using an image segmentation model and restoring the surrounding areas via depth completion, thereby generating poisoned datasets for object-level backdoor attacks. To improve robustness in physical world scenarios, we further introduce digital-to-physical augmentation to adapt to the domain gap between the physical world and the digital domain. Extensive experiments on multiple models validate the effectiveness of BadDepth in both the digital domain and the physical world, without being affected by environmental factors.
近年来,基于深度学习的单目深度估计(MDE)模型在自动驾驶和机器人技术等领域得到了广泛应用。然而,这些模型对后门攻击的脆弱性尚未得到充分研究。为了填补这一空白,我们进行了一项全面的研究,探讨针对MDE模型的后门攻击。通常情况下,现有的后门攻击方法不能应用于MDE模型,这是因为MDE使用的标签是以深度图的形式表示的。为了解决这个问题,我们提出了BadDepth,这是首个专门针对MDE模型设计的后门攻击技术。 BadDepth通过利用图像分割模型选择性地操纵目标对象的深度,并通过深度完成恢复周围区域来克服这一限制,从而生成用于物体级后门攻击的中毒数据集。为了提高在物理世界场景中的鲁棒性,我们进一步引入了数字到物理领域的增强方法,以适应物理世界和数字领域之间的域差距。 我们在多个模型上进行了广泛的实验,并验证了BadDepth在数字领域和现实世界的有效性,而且不受环境因素的影响。
https://arxiv.org/abs/2505.16154
Leaf diseases are harmful conditions that affect the health, appearance and productivity of plants, leading to significant plant loss and negatively impacting farmers' livelihoods. These diseases cause visible symptoms such as lesions, color changes, and texture variations, making it difficult for farmers to manage plant health, especially in large or remote farms where expert knowledge is limited. The main motivation of this study is to provide an efficient and accessible solution for identifying plant leaf diseases in Bangladesh, where agriculture plays a critical role in food security. The objective of our research is to classify 21 distinct leaf diseases across six plants using deep learning models, improving disease detection accuracy while reducing the need for expert involvement. Deep Learning (DL) techniques, including CNN and Transfer Learning (TL) models like VGG16, VGG19, MobileNetV2, InceptionV3, ResNet50V2 and Xception are used. VGG19 and Xception achieve the highest accuracies, with 98.90% and 98.66% respectively. Additionally, Explainable AI (XAI) techniques such as GradCAM, GradCAM++, LayerCAM, ScoreCAM and FasterScoreCAM are used to enhance transparency by highlighting the regions of the models focused on during disease classification. This transparency ensures that farmers can understand the model's predictions and take necessary action. This approach not only improves disease management but also supports farmers in making informed decisions, leading to better plant protection and increased agricultural productivity.
叶片疾病是影响植物健康、外观和生产力的有害状况,导致农作物损失,并对农民的生活产生负面影响。这些病害会引发可见的症状,如病变、颜色变化及质地变异,使得农民难以管理植物健康,特别是在大型或偏远农场中,专业知识有限的情况下更是如此。本研究的主要动机是在孟加拉国提供一种高效且易于访问的解决方案来识别叶片疾病,农业在该国粮食安全方面发挥着关键作用。我们的研究目标是使用深度学习模型对六种作物的21种不同叶片病害进行分类,提高疾病检测准确性,并减少专家参与的需求。 本研究所采用的技术包括卷积神经网络(CNN)和迁移学习(TL)模型,例如VGG16、VGG19、MobileNetV2、InceptionV3、ResNet50V2 和 Xception。其中,VGG19 和 Xception 模型分别达到了最高的准确率,分别为 98.90% 和 98.66%。 此外,本研究还采用可解释的人工智能(XAI)技术如GradCAM、GradCAM++、LayerCAM、ScoreCAM 和 FasterScoreCAM 来增强模型的透明度,突出显示模型在分类疾病时关注的区域。这种透明度确保农民能够理解模型预测结果,并采取必要的行动。 这种方法不仅改善了疾病的管理,还支持农民做出明智决策,从而实现更好的植物保护和增加农业生产力。
https://arxiv.org/abs/2505.16033
Advanced diagnostic instruments are crucial for the accurate detection and treatment of lung diseases, which affect millions of individuals globally. This study examines the effectiveness of deep learning and transfer learning models using a hybrid dataset, created by merging four individual datasets from Bangladesh and global sources. The hybrid dataset significantly enhances model accuracy and generalizability, particularly in detecting COVID-19, pneumonia, lung opacity, and normal lung conditions from chest X-ray images. A range of models, including CNN, VGG16, VGG19, InceptionV3, Xception, ResNet50V2, InceptionResNetV2, MobileNetV2, and DenseNet121, were applied to both individual and hybrid datasets. The results showed superior performance on the hybrid dataset, with VGG16, Xception, ResNet50V2, and DenseNet121 each achieving an accuracy of 99%. This consistent performance across the hybrid dataset highlights the robustness of these models in handling diverse data while maintaining high accuracy. To understand the models implicit behavior, explainable AI techniques were employed to illuminate their black-box nature. Specifically, LIME was used to enhance the interpretability of model predictions, especially in cases of misclassification, contributing to the development of reliable and interpretable AI-driven solutions for medical imaging.
先进的诊断仪器对于全球数百万因肺部疾病而受影响的个体而言,是准确检测和治疗的关键。本研究探讨了使用混合数据集(通过合并来自孟加拉国及全球来源的四个独立数据集)训练深度学习与迁移学习模型的有效性。该混合数据集显著提高了模型在从胸部X光片中识别新冠肺炎、肺炎、肺部斑块以及正常肺状态等方面的准确性和泛化能力。 研究应用了一系列模型,包括CNN(卷积神经网络)、VGG16、VGG19、InceptionV3、Xception、ResNet50V2、InceptionResNetV2、MobileNetV2和DenseNet121,分别在独立数据集与混合数据集上进行测试。结果显示,在混合数据集中这些模型表现出色,其中VGG16、Xception、ResNet50V2和DenseNet121的准确率均达到了99%。 这种跨混合数据集的一致表现凸显了这些模型在处理多样化数据时保持高精度的能力。为了更好地理解这些模型的行为特性,研究采用了可解释的人工智能技术来揭示其黑箱性质。具体来说,使用LIME(局部可解释性模型解释)提高了模型预测的可解释性,特别是在分类错误的情况下,这有助于开发出可靠且易解释的人工智能解决方案在医学成像领域中的应用。
https://arxiv.org/abs/2505.16028
This position paper argues that the image processing community should broaden its focus from purely model-centric development to include agentic system design as an essential complementary paradigm. While deep learning has significantly advanced capabilities for specific image processing tasks, current approaches face critical limitations in generalization, adaptability, and real-world problem-solving flexibility. We propose that developing intelligent agentic systems, capable of dynamically selecting, combining, and optimizing existing image processing tools, represents the next evolutionary step for the field. Such systems would emulate human experts' ability to strategically orchestrate different tools to solve complex problems, overcoming the brittleness of monolithic models. The paper analyzes key limitations of model-centric paradigms, establishes design principles for agentic image processing systems, and outlines different capability levels for such agents.
这篇立场论文主张,图像处理社区应当将研究重点从纯粹的模型为中心的发展扩展到包括智能代理系统设计这一重要的互补范式。虽然深度学习在特定的图像处理任务中显著提升了能力,但目前的方法在泛化、适应性和解决现实问题的灵活性方面面临关键性的限制。我们提出,开发能够动态选择、组合和优化现有图像处理工具的智能代理系统,代表了该领域的下一个进化步骤。这样的系统将模仿人类专家的战略性地运用不同工具来解决问题的能力,克服单一模型的脆弱性。 论文分析了以模型为中心范式的局限性,确立了用于设计智能图像处理系统的原理,并概述了这些代理的不同能力级别。
https://arxiv.org/abs/2505.16007
Exploring the trustworthiness of deep learning models is crucial, especially in critical domains such as medical imaging decision support systems. Conformal prediction has emerged as a rigorous means of providing deep learning models with reliable uncertainty estimates and safety guarantees. However, conformal prediction results face challenges due to the backbone model's struggles in domain-shifted scenarios, such as variations in different sources. To aim this challenge, this paper proposes a novel framework termed Conformal Ensemble of Vision Transformers (CE-ViTs) designed to enhance image classification performance by prioritizing domain adaptation and model robustness, while accounting for uncertainty. The proposed method leverages an ensemble of vision transformer models in the backbone, trained on diverse datasets including HAM10000, Dermofit, and Skin Cancer ISIC datasets. This ensemble learning approach, calibrated through the combined mentioned datasets, aims to enhance domain adaptation through conformal learning. Experimental results underscore that the framework achieves a high coverage rate of 90.38\%, representing an improvement of 9.95\% compared to the HAM10000 model. This indicates a strong likelihood that the prediction set includes the true label compared to singular models. Ensemble learning in CE-ViTs significantly improves conformal prediction performance, increasing the average prediction set size for challenging misclassified samples from 1.86 to 3.075.
探索深度学习模型的可信度至关重要,特别是在医疗影像决策支持系统等关键领域。校形预测(Conformal Prediction)作为一种提供可靠不确定性估计和安全保证的严谨手段已经崭露头角。然而,在如不同数据源变化这样的领域偏移场景中,基本模型在应对这些挑战时面临着困难。为了解决这些问题,本文提出了一种名为视觉变压器一致性集成(Conformal Ensemble of Vision Transformers, CE-ViTs)的新框架。该框架旨在通过优先考虑域适应和模型鲁棒性来提升图像分类性能,并同时考虑到不确定性因素。 所提出的这种方法利用了一组在多样化的数据集上进行训练的视觉变压器模型作为基本架构,这些数据集包括HAM10000、Dermofit以及皮肤癌ISIC数据集。通过结合上述提及的数据集进行校准,这种集成学习方法旨在借助一致性学习增强域适应能力。 实验结果显示,该框架实现了90.38%的高覆盖率,相比单一模型(如HAM10000)提高了9.95%,这表明预测集合包含真实标签的可能性显著提高。在CE-ViTs中的集成学习极大地提升了校形预测的表现,在挑战性的误分类样本上平均预测集大小从1.86增加到了3.075。 综上所述,这种新框架不仅增强了模型面对域偏移场景时的适应能力,同时也提高了其输出不确定性估计的有效性和鲁棒性。
https://arxiv.org/abs/2505.15997
This study presents a deep-learning framework for controlling multichannel acoustic feedback in audio devices. Traditional digital signal processing methods struggle with convergence when dealing with highly correlated noise such as feedback. We introduce a Convolutional Recurrent Network that efficiently combines spatial and temporal processing, significantly enhancing speech enhancement capabilities with lower computational demands. Our approach utilizes three training methods: In-a-Loop Training, Teacher Forcing, and a Hybrid strategy with a Multichannel Wiener Filter, optimizing performance in complex acoustic environments. This scalable framework offers a robust solution for real-world applications, making significant advances in Acoustic Feedback Control technology.
这项研究提出了一种用于控制音频设备多通道声反馈的深度学习框架。传统的数字信号处理方法在处理像反馈这样的高度相关噪声时,难以达到收敛状态。我们引入了卷积循环网络(Convolutional Recurrent Network),这种网络能够高效地结合空间和时间处理,显著增强语音增强能力,并且降低了计算需求。 我们的方法采用了三种训练方式:闭环内训练(In-a-Loop Training)、教师强迫法(Teacher Forcing)以及与多通道维纳滤波器相结合的混合策略(Hybrid strategy),以优化在复杂声学环境下的性能。这种可扩展框架为现实世界的应用提供了一种稳健的解决方案,并在声反馈控制技术方面取得了重要进展。
https://arxiv.org/abs/2505.15914