Understanding the complex myocardial architecture is critical for diagnosing and treating heart disease. However, existing methods often struggle to accurately capture this intricate structure from Diffusion Tensor Imaging (DTI) data, particularly due to the lack of ground truth labels and the ambiguous, intertwined nature of fiber trajectories. We present a novel deep learning framework for unsupervised clustering of myocardial fibers, providing a data-driven approach to identifying distinct fiber bundles. We uniquely combine a Bidirectional Long Short-Term Memory network to capture local sequential information along fibers, with a Transformer autoencoder to learn global shape features, with pointwise incorporation of essential anatomical context. Clustering these representations using a density-based algorithm identifies 33 to 62 robust clusters, successfully capturing the subtle distinctions in fiber trajectories with varying levels of granularity. Our framework offers a new, flexible, and quantitative way to analyze myocardial structure, achieving a level of delineation that, to our knowledge, has not been previously achieved, with potential applications in improving surgical planning, characterizing disease-related remodeling, and ultimately, advancing personalized cardiac care.
理解复杂的心肌架构对于诊断和治疗心脏病至关重要。然而,现有的方法往往难以从扩散张量成像(DTI)数据中准确捕捉这种复杂的结构,主要原因是缺乏地面真实标签以及纤维轨迹的模糊性和纠缠性。我们提出了一种新颖的深度学习框架,用于心肌纤维的无监督聚类,提供一种基于数据驱动的方法来识别不同的纤维束。 我们的方法独特地结合了双向长短时记忆(BiLSTM)网络和Transformer自编码器。BiLSTM网络用来捕捉沿纤维方向上的局部序列信息,而Transformer自编码器则学习全局形状特征,并在点级别上融入重要的解剖学上下文。通过基于密度的算法对这些表示进行聚类,可以识别出33到62个稳健的簇群,成功地捕获了不同层次粒度下的纤维轨迹细微差别。 我们的框架提供了一种新的、灵活且定量的方法来分析心肌结构,达到了据我们所知前所未有的精细划分程度。这具有潜在的应用价值,如改善手术规划,表征疾病相关的重塑过程,并最终推进个性化心脏护理的发展。
https://arxiv.org/abs/2504.01953
Remote photoplethysmography (rPPG), enabling non-contact physiological monitoring through facial light reflection analysis, faces critical computational bottlenecks as deep learning introduces performance gains at the cost of prohibitive resource demands. This paper proposes ME-rPPG, a memory-efficient algorithm built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time constraints. Leveraging a transferable state space, ME-rPPG efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. Achieving cross-dataset MAEs of 5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all baselines with improvements ranging from 21.3% to 60.2%. Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency -- surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction gains in real-world deployments. The code and demos are released for reproducibility on this https URL.
远程光容积描记法(rPPG)通过面部光线反射分析实现非接触式生理监测,但随着深度学习带来的性能提升,也面临着资源需求过高的计算瓶颈。本文提出了一种基于时间-空间状态空间二元性的内存高效算法ME-rPPG,解决了模型可扩展性、跨数据集泛化能力和实时约束之间的三难问题。利用可转移的状态空间,ME-rPPG能够有效地捕捉面部帧间的细微周期变化,并保持最低的计算开销,从而支持长时间视频序列训练和低延迟推理。 在跨数据集均方误差(MAE)测试中,ME-rPPG分别达到了MMPD 5.38、VitalVideo 0.70 和 PURE 0.25 的成绩,并且超越了所有基线模型,改进幅度从21.3%到60.2%不等。我们的解决方案能够在仅使用3.6 MB内存和延迟9.46毫秒的情况下实现实时推理,在现实世界部署中比现有方法提高了19.5%-49.7%的准确率,并带来了用户满意度提升的43.2%。 代码与演示可在以下网址找到,以确保研究结果的可重复性:[URL](请将[URL]替换为实际链接)。
https://arxiv.org/abs/2504.01774
Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at this https URL
监督深度学习在医学图像中的语义分割方面已经取得了识别解剖和病理结构的卓越成果。然而,这种方法通常需要大量的标注训练数据集,这限制了其在临床环境下的可扩展性。为了解决这一挑战,半监督学习作为一种利用有标签和无标签数据的方法得到了广泛应用。在这篇论文中,我们介绍了一种新的生物医学图像分割的半监督教师-学生框架,灵感来源于最近生成模型的成功应用。我们的方法使用去噪扩散概率模型(DDPM)通过逐步细化噪声输入来生成分割掩码,并根据相应的图像进行条件设置。首先在无监督的方式下训练教师模型,采用基于噪声污染图像重建的循环一致性约束,使其能够生成信息丰富的语义掩码。随后,将教师模型整合到与双生学生网络(twin-student network)协同训练的过程中。当有真实标签时,学生从这些标签中学习;否则,它从老师产生的伪标签中学习,而教师则持续提高其伪标记的能力。最后,为了进一步提升性能,我们引入了一种多轮次的伪标签生成策略,以迭代改进伪标记过程。 我们在多个生物医学成像基准上评估了我们的方法,涵盖了多种成像模式和分割任务。实验结果表明,在数据标注有限的情况下,我们的方法始终优于当前最先进的半监督技术,展示了其有效性。复制我们实验所需代码可以在这里的URL找到:[提供具体的网址链接]
https://arxiv.org/abs/2504.01547
Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.
结肠息肉和皮肤病变的精确分割对于诊断结直肠癌和皮肤癌至关重要。尽管已经开发出了多种使用全监督深度学习技术对息肉和皮肤病变进行分割的方法,但医生为医学图像提供像素级别的标注既耗时又昂贵。基础视觉模型如Segment Anything Model (SAM) 已经展示了卓越的性能;然而,直接将SAM应用于医学分割可能由于缺乏特定领域的医疗知识而无法获得令人满意的结果。在本文中,我们提出了BiSeg-SAM,这是一个由SAM引导的弱监督提示和边界细化网络,用于息肉和皮肤病变的分割。 具体来说,我们将SAM与CNN模块结合进行微调以学习局部特征。我们引入了一个名为WeakBox的功能组件:它可以自动为SAM模型生成框提示,并使用我们提出的Multi-choice Mask-to-Box (MM2B) 转换技术来进行粗略的掩码到边界框转换,解决了粗糙标签和精确预测之间的不匹配问题。此外,我们应用尺度一致性(SC)损失来对齐预测的尺度。我们的DetailRefine模块通过利用少量的真实标注数据来细化粗略的预测结果以提高边界的精度和分割准确性。这种全面的方法使BiSeg-SAM能够实现卓越的多任务分割性能。 在五个息肉数据集和一个皮肤癌数据集中进行测试时,我们的方法相对于最先进的(SOTA)方法表现出显著的优势。
https://arxiv.org/abs/2504.01452
This paper introduces the Deep Learning-based Nonlinear Model Predictive Controller with Scene Dynamics (DL-NMPC-SD) method for autonomous navigation. DL-NMPC-SD uses an a-priori nominal vehicle model in combination with a scene dynamics model learned from temporal range sensing information. The scene dynamics model is responsible for estimating the desired vehicle trajectory, as well as to adjust the true system model used by the underlying model predictive controller. We propose to encode the scene dynamics model within the layers of a deep neural network, which acts as a nonlinear approximator for the high order state-space of the operating conditions. The model is learned based on temporal sequences of range sensing observations and system states, both integrated by an Augmented Memory component. We use Inverse Reinforcement Learning and the Bellman optimality principle to train our learning controller with a modified version of the Deep Q-Learning algorithm, enabling us to estimate the desired state trajectory as an optimal action-value function. We have evaluated DL-NMPC-SD against the baseline Dynamic Window Approach (DWA), as well as against two state-of-the-art End2End and reinforcement learning methods, respectively. The performance has been measured in three experiments: i) in our GridSim virtual environment, ii) on indoor and outdoor navigation tasks using our RovisLab AMTU (Autonomous Mobile Test Unit) platform and iii) on a full scale autonomous test vehicle driving on public roads.
本文介绍了一种基于深度学习的非线性模型预测控制器与场景动力学(DL-NMPC-SD)方法,用于自主导航。DL-NMPC-SD 结合了预先设定的车辆模型和从时间范围感测信息中学习到的场景动力学模型。场景动力学模型负责估计期望的车辆轨迹,并调整底层模型预测控制器使用的实际系统模型。我们提出将场景动力学模型编码在深度神经网络的层中,该网络作为操作条件高阶状态空间的非线性逼近器。基于范围感测观察的时间序列和系统的状态信息(通过增强记忆组件集成)来学习此模型。 我们使用逆向强化学习和贝尔曼最优原则训练我们的学习控制器,并采用修改版的深度 Q 学习算法,从而能够将期望的状态轨迹估计为最优的动作值函数。我们在三个实验中评估了 DL-NMPC-SD 相对于基线动态窗口方法(DWA)以及两种最新的端到端和强化学习方法的表现: 1) 在我们的 GridSim 虚拟环境中, 2) 使用我们 RovisLab AMTU (自主移动测试单元) 平台进行室内和室外导航任务,以及 3) 在公共道路上行驶的全尺寸自动驾驶测试车辆中。 这些实验结果展示了 DL-NMPC-SD 方法在不同应用场景中的性能。
https://arxiv.org/abs/2504.01336
Recent advancements in visual odometry systems have improved autonomous navigation; however, challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise feature correspondence accuracy. To address these challenges, we introduce ForestGlue, enhancing the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline models but requires only 512 keypoints - just 25% of the baseline's 2048 - to reach an LO-RANSAC AUC score of 0.745 at a 10° threshold. With only a quarter of keypoints needed, ForestGlue significantly reduces computational overhead, demonstrating effectiveness in dynamic forest environments, and making it suitable for real-time deployment on resource-constrained platforms. By combining ForestGlue with a transformer-based pose estimation model, we propose ForestVO, which estimates relative camera poses using matched 2D pixel coordinates between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10% of the dataset for training, ForestVO maintains competitive performance with TartanVO while being a significantly lighter model. This work establishes an end-to-end deep learning pipeline specifically tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation, thereby enhancing the accuracy and robustness of autonomous navigation systems.
近期在视觉测距系统方面的进展提升了自主导航的能力,然而,在复杂环境如森林中,密集的植被、变化的光照和重复纹理仍会降低特征匹配的准确性。为了解决这些问题,我们推出了ForestGlue,通过四种配置(灰度、RGB、RGB-D和立体视觉)优化了SuperPoint特征检测器,使其适用于各种传感模式。对于特征匹配,我们使用LightGlue或重新训练过的SuperGlue,并且用合成森林数据进行再训练。 ForestGlue在达到LO-RANSAC AUC得分为0.745(阈值为10°)时,只需要2048个基线模型中512个关键点的25%。通过仅使用四分之一的关键点数量,ForestGlue大大减少了计算开销,在动态森林环境中表现出色,并且适用于资源受限平台上的实时部署。结合基于变换器的姿态估计模型,我们提出了ForestVO,该系统利用帧间匹配的二维像素坐标来估算相对相机姿态。 在具有挑战性的TartanAir森林序列中,ForestVO实现了平均相对姿态误差(RPE)为1.09米和kitti得分为2.33%,比DSO等直接方法在动态场景中的表现高出40%。尽管仅使用了数据集的10%进行训练,ForestVO仍然保持与TartanVO相当的性能水平,同时其模型轻量得多。 这项工作建立了一个专门针对森林环境视觉测距的端到端深度学习流水线,利用特定于森林的培训数据优化特征匹配和姿态估计,从而提高了自主导航系统的准确性和鲁棒性。
https://arxiv.org/abs/2504.01261
Underwater images suffer from severe degradations, including color distortions, reduced visibility, and loss of structural details due to wavelength-dependent attenuation and scattering. Existing enhancement methods primarily focus on spatial-domain processing, neglecting the frequency domain's potential to capture global color distributions and long-range dependencies. To address these limitations, we propose FUSION, a dual-domain deep learning framework that jointly leverages spatial and frequency domain information. FUSION independently processes each RGB channel through multi-scale convolutional kernels and adaptive attention mechanisms in the spatial domain, while simultaneously extracting global structural information via FFT-based frequency attention. A Frequency Guided Fusion module integrates complementary features from both domains, followed by inter-channel fusion and adaptive channel recalibration to ensure balanced color distributions. Extensive experiments on benchmark datasets (UIEB, EUVP, SUIM-E) demonstrate that FUSION achieves state-of-the-art performance, consistently outperforming existing methods in reconstruction fidelity (highest PSNR of 23.717 dB and SSIM of 0.883 on UIEB), perceptual quality (lowest LPIPS of 0.112 on UIEB), and visual enhancement metrics (best UIQM of 3.414 on UIEB), while requiring significantly fewer parameters (0.28M) and lower computational complexity, demonstrating its suitability for real-time underwater imaging applications.
水下图像受到严重的退化,包括颜色失真、能见度降低以及由于波长依赖的衰减和散射而导致结构细节丢失。现有的增强方法主要集中在空间域处理上,忽略了频域在捕捉全局色彩分布及远程依赖性方面的作用潜力。为了克服这些限制,我们提出了一种双域深度学习框架FUSION,该框架同时利用了空间域与频率域的信息。 FUSION通过多尺度卷积核和自适应注意机制,在空间域内独立处理每个RGB通道;同时使用基于FFT的频域注意力提取全局结构信息。频域引导融合模块将来自两个领域的互补特征进行整合,随后通过跨通道融合和自适应通道再校准来确保均衡的颜色分布。 在基准数据集(UIEB、EUVP、SUIM-E)上的广泛实验表明,FUSION实现了最先进的性能,在图像重建保真度(在UIEB上PSNR最高为23.717 dB 和 SSIM 0.883)、感知质量(在UIEB上LPIPS最低0.112)以及视觉增强指标(在UIEB上UIQM最佳值为3.414)方面,持续优于现有方法,并且只需较少的参数量(0.28M)和较低的计算复杂度,展示了其适用于实时水下成像应用的优势。
https://arxiv.org/abs/2504.01243
Deep learning models have achieved remarkable success in computer vision but remain vulnerable to adversarial attacks, particularly in black-box settings where model details are unknown. Existing adversarial attack methods(even those works with key frames) often treat video data as simple vectors, ignoring their inherent multi-dimensional structure, and require a large number of queries, making them inefficient and detectable. In this paper, we propose \textbf{TenAd}, a novel tensor-based low-rank adversarial attack that leverages the multi-dimensional properties of video data by representing videos as fourth-order tensors. By exploiting low-rank attack, our method significantly reduces the search space and the number of queries needed to generate adversarial examples in black-box settings. Experimental results on standard video classification datasets demonstrate that \textbf{TenAd} effectively generates imperceptible adversarial perturbations while achieving higher attack success rates and query efficiency compared to state-of-the-art methods. Our approach outperforms existing black-box adversarial attacks in terms of success rate, query efficiency, and perturbation imperceptibility, highlighting the potential of tensor-based methods for adversarial attacks on video models.
深度学习模型在计算机视觉领域取得了显著的成功,但仍然容易受到对抗性攻击的影响,尤其是在黑盒设置中,即当模型的细节未知时。现有的对抗性攻击方法(即使那些针对关键帧的方法)通常将视频数据视为简单的向量,忽略了它们固有的多维结构,并且需要大量的查询,这使得这些方法既低效又易于检测。在本文中,我们提出了一种名为**TenAd**的新颖张量基低秩对抗性攻击方法,该方法通过将视频表示为四阶张量来利用视频数据的多维特性。通过利用低秩攻击,我们的方法显著减少了生成黑盒设置下对抗样本所需的搜索空间和查询次数。在标准视频分类数据集上的实验结果表明,**TenAd**能够有效生成不可察觉的对抗性扰动,并且与现有最先进的方法相比,在攻击成功率和查询效率方面表现出更高的性能。我们的方法在成功概率、查询效率以及扰动生成的不可感知性方面优于现有的黑盒对抗性攻击方法,这突显了基于张量的方法在视频模型对抗攻击中的潜力。
https://arxiv.org/abs/2504.01228
Remote photoplethysmography (rPPG) offers a novel approach to noninvasive monitoring of vital signs, such as respiratory rate, utilizing a camera. Although several supervised and self-supervised methods have been proposed, they often fail to accurately reconstruct the PPG signal, particularly in distinguishing between systolic and diastolic components. Their primary focus tends to be solely on extracting heart rate, which may not accurately represent the complete PPG signal. To address this limitation, this paper proposes a novel deep learning architecture using Generative Adversarial Networks by introducing multi-discriminators to extract rPPG signals from facial videos. These discriminators focus on the time domain, the frequency domain, and the second derivative of the original time domain signal. The discriminator integrates four loss functions: variance loss to mitigate local minima caused by noise; dynamic time warping loss to address local minima induced by alignment and sequences of variable lengths; Sparsity Loss for heart rate adjustment, and Variance Loss to ensure a uniform distribution across the desired frequency domain and time interval between systolic and diastolic phases of the PPG signal.
远程光电容积图(rPPG)提供了一种新颖的方法,用于利用摄像头非侵入性地监测生命体征,如呼吸率。尽管已经提出了几种监督和自我监督方法,但它们通常无法准确重建光电容积波信号,特别是在区分收缩期和舒张期成分方面存在困难。这些方法的主要关注点往往是仅提取心率,这可能不足以准确代表完整的PPG信号。 为了解决这一限制,本文提出了一种新颖的深度学习架构,使用生成对抗网络(GAN),并通过引入多鉴别器来从面部视频中提取rPPG信号。这些鉴别器专注于时间域、频率域以及原始时间域信号的二阶导数。该鉴别器集成了四种损失函数:方差损失用于缓解由噪声引起的局部最小值;动态时间规整损失用于解决因对齐和不同长度序列所导致的局部最小值问题;稀疏性损失用于调整心率,以及方差损失以确保在所需的频率域和收缩期与舒张期之间的时间间隔内保持均匀分布。
https://arxiv.org/abs/2504.01220
Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in this https URL.
深度学习模型在各种与图像相关的任务中取得了显著的成功。然而,它们经常遇到计算复杂性和过拟合的挑战。在这篇论文中,我们提出了一种有效的方法,利用图像的多边形表示(使用主导点或轮廓坐标)。通过将输入图像转换为这些紧凑形式,我们的方法可以显着减少计算需求、加速训练并节省资源,使其适用于实时和资源受限的应用程序。这些表示自然地捕获了关键的图像特征,并过滤掉了噪声,提供了一种天然的正则化效应以减轻过拟合问题。所得到的轻量级模型在性能上与使用全分辨率图像的最佳方法相当,同时能够在边缘设备上部署。在基准数据集上的广泛实验验证了我们方法在降低复杂性、改进泛化和促进边缘计算应用方面的有效性。这项工作展示了多边形表示在为现实世界场景推进高效且可扩展的深度学习解决方案方面所具有的潜力。论文实验的代码可在以下网址获取:[提供的URL]。
https://arxiv.org/abs/2504.01214
Hardware limitations and satellite launch costs make direct acquisition of high temporal-spatial resolution remote sensing imagery challenging. Remote sensing spatiotemporal fusion (STF) technology addresses this problem by merging high temporal but low spatial resolution imagery with high spatial but low temporal resolution imagery to efficiently generate high spatiotemporal resolution satellite images. STF provides unprecedented observational capabilities for land surface change monitoring, agricultural management, and environmental research. Deep learning (DL) methods have revolutionized the remote sensing spatiotemporal fusion field over the past decade through powerful automatic feature extraction and nonlinear modeling capabilities, significantly outperforming traditional methods in handling complex spatiotemporal data. Despite the rapid development of DL-based remote sensing STF, the community lacks a systematic review of this quickly evolving field. This paper comprehensively reviews DL developments in remote sensing STF over the last decade, analyzing key research trends, method classifications, commonly used datasets, and evaluation metrics. It discusses major challenges in existing research and identifies promising future research directions as references for researchers in this field to inspire new ideas. The specific models, datasets, and other information mentioned in this article have been collected in: this https URL.
硬件限制和卫星发射成本使得直接获取高时空分辨率的遥感影像具有挑战性。遥感时空融合(STF)技术通过合并高时间分辨率但低空间分辨率的影像与高空间分辨率但低时间分辨率的影像,高效生成高质量的时空遥感图像,从而解决了这一问题。STF为土地表面变化监测、农业管理和环境研究提供了前所未有的观测能力。在过去十年中,深度学习(DL)方法凭借其强大的自动特征提取和非线性建模能力彻底革新了遥感时空融合领域,在处理复杂时空数据方面远超传统方法的表现。尽管基于深度学习的遥感时空融合技术迅速发展,但该领域尚未形成系统的回顾与总结。本文全面回顾了过去十年中深度学习在遥感时空融合领域的进展,分析关键研究趋势、方法分类、常用数据集及评估指标,并讨论现有研究面临的重大挑战,展望未来有前景的研究方向,为相关研究人员提供灵感和参考。 文中提及的具体模型、数据集及其他信息可在以下网址查阅:[this URL](https://example.com)。请注意将上述示例URL替换为您实际提供的链接地址。
https://arxiv.org/abs/2504.00901
Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at this https URL.
脉冲神经网络(SNN)为基于事件的传感器数据的类脑处理提供了一种有前景的计算范式。特别是,共振与放电(RF)神经元因其生物合理性、复杂动力学以及计算简便性而引起了人们的关注。尽管在理论上预测到了许多优势,但在参数初始化和高效学习方面存在的挑战阻碍了RF网络的实际应用,使其仅限于单层使用。在这篇文章中,我们通过将RF神经元从HiPPO框架导出为结构化状态空间模型(SSM)来解决这些问题,并引入了一种新的S5-RF SSM层,它由基于S5模型的RF神经元组成,具有通用初始化方案和在深层架构中的快速训练特性。S5-RF首次将一个RF网络扩展到了深度脉冲神经网络中,最多可达四层,并且在Spiking Speech Commands数据集上实现了78.8%的新最佳结果,整个训练时间不到三小时。此外,与解决我们基准测试任务的参考SNN相比,它以较少的脉冲操作实现了相似性能。我们的代码可在以下网址公开获得:[此https URL]。 注释:文中提供的URL为占位符,请根据实际需要替换正确的URL地址。
https://arxiv.org/abs/2504.00719
Accurate measurement of eyelid parameters such as Margin Reflex Distances (MRD1, MRD2) and Levator Function (LF) is critical in oculoplastic diagnostics but remains limited by manual, inconsistent methods. This study evaluates deep learning models: SE-ResNet, EfficientNet, and the vision transformer-based DINOv2 for automating these measurements using smartphone-acquired images. We assess performance across frozen and fine-tuned settings, using MSE, MAE, and R2 metrics. DINOv2, pretrained through self-supervised learning, demonstrates superior scalability and robustness, especially under frozen conditions ideal for mobile deployment. Lightweight regressors such as MLP and Deep Ensemble offer high precision with minimal computational overhead. To address class imbalance and improve generalization, we integrate focal loss, orthogonal regularization, and binary encoding strategies. Our results show that DINOv2 combined with these enhancements delivers consistent, accurate predictions across all tasks, making it a strong candidate for real-world, mobile-friendly clinical applications. This work highlights the potential of foundation models in advancing AI-powered ophthalmic care.
准确测量眼睑参数,如边缘反射距离(MRD1和MRD2)及提上睑肌功能(LF),在眼整形诊断中至关重要,但目前仍受限于手动且不一致的方法。本研究评估了几种深度学习模型:SE-ResNet、EfficientNet以及基于视觉变换器的DINOv2,以利用智能手机获取的眼部图像实现这些测量的自动化。我们通过均方误差(MSE)、平均绝对误差(MAE)和R²指标,在冻结和微调设置下评估了这些模型的表现。 预训练模型DINOv2通过自监督学习获得了优异的可扩展性和鲁棒性,尤其是在冻结条件下表现尤为突出,这使其非常适合移动设备部署。轻量级回归器如多层感知机(MLP)及深度集成方法提供了高精度的同时减少了计算开销。为了应对类别不平衡和提高泛化能力,我们整合了焦损、正交规则化以及二进制编码策略。 实验结果显示,结合这些改进后的DINOv2模型在所有任务中均能提供一致且精确的预测结果,使其成为实际临床应用中的移动友好型候选方案。这项研究强调了基础模型在推动AI驱动的眼科护理进步方面的巨大潜力。
https://arxiv.org/abs/2504.00515
Immunohistochemical (IHC) staining serves as a valuable technique for detecting specific antigens or proteins through antibody-mediated visualization. However, the IHC staining process is both time-consuming and costly. To address these limitations, the application of deep learning models for direct translation of cost-effective Hematoxylin and Eosin (H&E) stained images into IHC stained images has emerged as an efficient solution. Nevertheless, the conversion from H&E to IHC images presents significant challenges, primarily due to alignment discrepancies between image pairs and the inherent diversity in IHC staining style patterns. To overcome these challenges, we propose the Style Distribution Constraint Feature Alignment Network (SCFANet), which incorporates two innovative modules: the Style Distribution Constrainer (SDC) and Feature Alignment Learning (FAL). The SDC ensures consistency between the generated and target images' style distributions while integrating cycle consistency loss to maintain structural consistency. To mitigate the complexity of direct image-to-image translation, the FAL module decomposes the end-to-end translation task into two subtasks: image reconstruction and feature alignment. Furthermore, we ensure pathological consistency between generated and target images by maintaining pathological pattern consistency and Optical Density (OD) uniformity. Extensive experiments conducted on the Breast Cancer Immunohistochemical (BCI) dataset demonstrate that our SCFANet model outperforms existing methods, achieving precise transformation of H&E-stained images into their IHC-stained counterparts. The proposed approach not only addresses the technical challenges in H&E to IHC image translation but also provides a robust framework for accurate and efficient stain conversion in pathological analysis.
免疫组化(IHC)染色是一种通过抗体介导的可视化方法来检测特定抗原或蛋白质的重要技术。然而,IHC 染色过程既耗时又昂贵。为了解决这些问题,使用深度学习模型将成本效益高的苏木精和伊红(H&E)染色图像直接转换成 IHC 染色图像已成为一种有效的解决方案。不过,从 H&E 到 IHC 图像的转换面临重大挑战,主要是由于图对之间的定位偏差以及 IHC 染色风格模式的固有多样性。为了克服这些挑战,我们提出了一个新颖的模型——样式分布约束特征对齐网络(SCFANet),该模型集成了两个创新模块:样式分布约束器(SDC)和特征对齐学习(FAL)。SDC 确保生成图像与目标图像的风格分布一致,并通过集成循环一致性损失来保持结构一致性。为了减轻直接图像到图像翻译的复杂性,FAL 模块将端到端转换任务分解为两个子任务:图像重建和特征对齐。此外,我们确保生成的图像与目标图像之间在病理学上的一致性,即保持病理模式一致性和光密度(OD)均匀性。 我们在乳腺癌免疫组化(BCI)数据集上进行了广泛的实验,证明了我们的 SCFANet 模型优于现有方法,能够精确地将 H&E 染色图像转换为其 IHC 染色版本。所提出的方案不仅解决了从 H&E 到 IHC 图像翻译中的技术挑战,还为病理分析中准确和高效的染色转换提供了一个稳健的框架。
https://arxiv.org/abs/2504.00490
The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.
可穿戴技术的普及已建立起一个多设备生态系统,包括智能手机、智能手表和耳机,这些设备已成为无处不在的人行定位的关键使能器。然而,传统的行人航位推算(PDR)在处理多样化的运动模式时存在困难,而数据驱动的方法虽然提高了精度,但由于依赖单一设备设置往往缺乏鲁棒性。因此,一个有前景的解决方案是充分利用现有的可穿戴设备来形成灵活的身体网络(flexiwear bodynet),以实现稳健且准确的人行定位。本文介绍了一种名为Suite-IN++的深度学习框架,该框架基于柔性身体网络进行行人定位。Suite-IN++ 集成了不同身体部位的可穿戴设备的动作数据,并使用对比学习将全局和局部运动特征分开。它根据每个设备的数据可靠性来融合全局特征以捕捉整体运动趋势,并采用注意力机制挖掘局部特征中的跨设备相关性,提取有助于精确定位的运动细节。 为了评估我们的方法,我们构建了一个现实生活中的柔性身体网络数据集,该数据集涵盖了苹果套件(iPhone、Apple Watch和AirPods)在各种行走模式和设备配置下的情况。实验结果表明,Suite-IN++ 实现了卓越的定位精度和鲁棒性,在实际行人跟踪场景中明显优于现有最先进的模型。
https://arxiv.org/abs/2504.00438
Adversarial attacks pose a critical security threat to real-world AI systems by injecting human-imperceptible perturbations into benign samples to induce misclassification in deep learning models. While existing detection methods, such as Bayesian uncertainty estimation and activation pattern analysis, have achieved progress through feature engineering, their reliance on handcrafted feature design and prior knowledge of attack patterns limits generalization capabilities and incurs high engineering costs. To address these limitations, this paper proposes a lightweight adversarial detection framework based on the large-scale pre-trained vision-language model CLIP. Departing from conventional adversarial feature characterization paradigms, we innovatively adopt an anomaly detection perspective. By jointly fine-tuning CLIP's dual visual-text encoders with trainable adapter networks and learnable prompts, we construct a compact representation space tailored for natural images. Notably, our detection architecture achieves substantial improvements in generalization capability across both known and unknown attack patterns compared to traditional methods, while significantly reducing training overhead. This study provides a novel technical pathway for establishing a parameter-efficient and attack-agnostic defense paradigm, markedly enhancing the robustness of vision systems against evolving adversarial threats.
对抗攻击通过在良性样本中注入人眼不可察觉的扰动,诱导深度学习模型产生误分类,从而对现实世界中的AI系统构成了严重的安全威胁。虽然现有的检测方法(如贝叶斯不确定性估计和激活模式分析)已通过特征工程取得了一定进展,但它们依赖于手工设计的特征和攻击模式的先验知识,这限制了泛化能力并增加了高昂的工程成本。为了解决这些局限性,本文提出了一种基于大规模预训练的视觉-语言模型CLIP(对比语言图像预训练)的轻量级对抗检测框架。不同于传统的对抗特征表征范式,我们创新地采用了异常检测视角。通过联合微调CLIP的双模态视觉和文本编码器,并引入可训练适配器网络和可学习提示词,我们在自然图像上构建了一个紧凑的表示空间。值得注意的是,与传统方法相比,我们的检测架构在已知和未知攻击模式下的泛化能力有了显著提升,同时大幅减少了训练负担。这项研究为建立参数高效且对抗攻击不可知的安全防御范式提供了一条新的技术路径,显著增强了视觉系统应对不断演变的对抗威胁的能力。
https://arxiv.org/abs/2504.00429
The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.
二维动物姿态估计任务在推动深度学习技术在动物行为分析和生态学研究中的应用方面发挥着关键作用。尽管现有方法已经取得了一些显著的进步,但我们的研究表明高质量数据集的稀缺仍然是一个重要的瓶颈问题,限制了当前方法的潜力。为了解决这一挑战,我们提出了一种名为AP-CAP(Animal Pose Controllable Image Generation Pipeline)的新颖可控图像生成流水线,用于合成动物姿态估计数据。在这个流程中,我们引入了一个多模态动物图像生成模型,能够产生具有预期姿势的图像。 为了提升生成数据的质量和多样性,我们进一步提出了三种创新策略: 1. 基于模态融合的动物图像合成策略,该策略整合了多种来源的外观表示。 2. 姿势调整基础上的动物图像合成策略,动态捕捉各种姿态的变化。 3. 以描述增强为基础的动物图像合成策略,用于丰富视觉语义理解。 通过我们提出的模型和策略,我们创建了MPCH数据集(多模态-姿态-描述混合),这是首个创新性地结合了合成与真实数据的混合数据集,并建立了迄今为止最大的多源异构基准库,专门用于动物姿态估计。广泛实验展示了我们的方法在提高动物姿态估计器性能及泛化能力方面的优越性。
https://arxiv.org/abs/2504.00394
Epilepsy is a common neurological disorder that affects around 65 million people worldwide. Detecting seizures quickly and accurately is vital, given the prevalence and severity of the associated complications. Recently, deep learning-based automated seizure detection methods have emerged as solutions; however, most existing methods require extensive post-processing and do not effectively handle the crucial long-range patterns in EEG data. In this work, we propose SeizureTransformer, a simple model comprised of (i) a deep encoder comprising 1D convolutions (ii) a residual CNN stack and a transformer encoder to embed previous output into high-level representation with contextual information, and (iii) streamlined decoder which converts these features into a sequence of probabilities, directly indicating the presence or absence of seizures at every time step. Extensive experiments on public and private EEG seizure detection datasets demonstrate that our model significantly outperforms existing approaches (ranked in the first place in the 2025 "seizure detection challenge" organized in the International Conference on Artificial Intelligence in Epilepsy and Other Neurological Disorders), underscoring its potential for real-time, precise seizure detection.
癫痫是一种常见的神经系统疾病,全球约有6500万人受其影响。鉴于其普遍性和相关并发症的严重性,快速准确地检测癫痫发作至关重要。最近,基于深度学习的自动癫痫发作检测方法已经出现;然而,大多数现有的方法需要大量的后处理,并且不能有效地处理EEG数据中的重要长距离模式。 在本项工作中,我们提出了SeizureTransformer模型,该模型由以下三部分组成:(i)一个深层编码器,包含一维卷积、残差CNN堆栈和变压器编码器以将之前的输出嵌入到具有上下文信息的高级表示中;(ii)以及一个精简的解码器,它将这些特征转换为一系列概率值,直接指示每个时间点癫痫发作的存在或不存在。 在公共和私有EEG癫痫检测数据集上进行的大规模实验表明,我们的模型显著优于现有方法(在2025年国际人工智能与癫痫及其他神经系统疾病会议上组织的“癫痫检测挑战”中排名第一),突显了其在实时精确癫痫发作检测方面的潜力。
https://arxiv.org/abs/2504.00336
While convolutional neural networks (CNNs) and vision transformers (ViTs) have advanced medical image segmentation, they face inherent limitations such as local receptive fields in CNNs and high computational complexity in ViTs. This paper introduces Deconver, a novel network that integrates traditional deconvolution techniques from image restoration as a core learnable component within a U-shaped architecture. Deconver replaces computationally expensive attention mechanisms with efficient nonnegative deconvolution (NDC) operations, enabling the restoration of high-frequency details while suppressing artifacts. Key innovations include a backpropagation-friendly NDC layer based on a provably monotonic update rule and a parameter-efficient design. Evaluated across four datasets (ISLES'22, BraTS'23, GlaS, FIVES) covering both 2D and 3D segmentation tasks, Deconver achieves state-of-the-art performance in Dice scores and Hausdorff distance while reducing computational costs (FLOPs) by up to 90% compared to leading baselines. By bridging traditional image restoration with deep learning, this work offers a practical solution for high-precision segmentation in resource-constrained clinical workflows. The project is available at this https URL.
虽然卷积神经网络(CNN)和视觉变换器(ViT)在医学图像分割领域取得了显著进展,但它们也面临固有的局限性,例如 CNN 的局部感受野限制以及 ViT 的高计算复杂度。本文介绍了一种名为 Deconver 的新型网络,它将传统的去卷积技术从图像恢复中引入到 U 形架构中的核心可学习组件。Deconver 用高效的非负去卷积(NDC)操作替换了昂贵的注意力机制,从而在抑制伪影的同时能够恢复高频细节。关键创新点包括基于单调更新规则的前向传播友好的 NDC 层以及参数高效的设计。 Deconver 在四个数据集(ISLES'22、BraTS'23、GlaS 和 FIVES)上进行了评估,这些数据集涵盖了二维和三维分割任务。在 Dice 系数和 Hausdorff 距离等性能指标方面,Deconver 达到了最先进的表现,并且与领先的基准模型相比,计算成本(FLOPs)最多降低了 90%。 通过将传统图像恢复技术与深度学习相结合,这项工作为资源受限的临床工作流程中的高精度分割提供了一种实用解决方案。该项目可在此网址访问:[this https URL]。
https://arxiv.org/abs/2504.00302
The deployment of a continuous methane monitoring system requires determining the optimal number and placement of fixed sensors. However, planning is labor-intensive, requiring extensive site setup and iteration to meet client restrictions. This challenge is amplified when evaluating multiple sites, limiting scalability. To address this, we introduce SmartScan, an AI framework that automates data extraction for optimal sensor placement. SmartScan identifies subspaces of interest from satellite images using an interactive tool to create facility-specific constraint sets efficiently. SmartScan leverages the Segment Anything Model (SAM), a prompt-based transformer for zero-shot segmentation, enabling subspace extraction without explicit training. It operates in two modes: (1) Data Curation Mode, where satellite images are processed to extract high-quality subspaces using an interactive prompting system for SAM, and (2) Autonomous Mode, where user-curated prompts train a deep learning network to replace manual prompting, fully automating subspace extraction. The interactive tool also serves for quality control, allowing users to refine AI-generated outputs and generate additional constraint sets as needed. With its AI-driven prompting mechanism, SmartScan delivers high-throughput, high-quality subspace extraction with minimal human intervention, enhancing scalability and efficiency. Notably, its adaptable design makes it suitable for extracting regions of interest from ultra-high-resolution satellite imagery across various domains.
部署连续甲烷监测系统需要确定固定传感器的最优数量和位置。然而,这一过程既费时又耗力,因为需要进行广泛的现场设置和迭代以满足客户限制条件。当评估多个站点时,这种挑战会进一步加剧,从而影响可扩展性。为此,我们引入了SmartScan,这是一个人工智能框架,能够自动提取数据用于优化传感器的放置位置。 SmartScan利用一个交互式工具从卫星图像中识别出感兴趣的子空间,并根据设施特定的要求高效地创建约束集。该系统采用Segment Anything Model(SAM),这是一种基于提示的转换器模型,可实现零样本分割,无需进行显式的训练即可提取子空间。SmartScan有两种工作模式: 1. 数据整理模式:在这种模式下,卫星图像经过处理以提取高质量的子空间,并通过交互式提示系统为SAM生成提示。 2. 自主模式:在此模式中,用户生成的提示被用来训练深度学习网络,从而替换手动提示过程,实现完全自动化的子空间提取。 此外,该交互工具还用于质量控制,允许用户精炼AI产生的输出并根据需要创建额外的约束集。凭借其基于人工智能的触发机制,SmartScan能够以最少的人工干预提供高通量、高质量的子空间提取服务,从而提升可扩展性和效率。值得一提的是,由于其灵活的设计,该系统适用于从多个领域的超高分辨率卫星图像中提取感兴趣的区域。
https://arxiv.org/abs/2504.00200