Segmenting brain tumors in multi-parametric magnetic resonance imaging enables performing quantitative analysis in support of clinical trials and personalized patient care. This analysis provides the potential to impact clinical decision-making processes, including diagnosis and prognosis. In 2023, the well-established Brain Tumor Segmentation (BraTS) challenge presented a substantial expansion with eight tasks and 4,500 brain tumor cases. In this paper, we present a deep learning-based ensemble strategy that is evaluated for newly included tumor cases in three tasks: pediatric brain tumors (PED), intracranial meningioma (MEN), and brain metastases (MET). In particular, we ensemble outputs from state-of-the-art nnU-Net and Swin UNETR models on a region-wise basis. Furthermore, we implemented a targeted post-processing strategy based on a cross-validated threshold search to improve the segmentation results for tumor sub-regions. The evaluation of our proposed method on unseen test cases for the three tasks resulted in lesion-wise Dice scores for PED: 0.653, 0.809, 0.826; MEN: 0.876, 0.867, 0.849; and MET: 0.555, 0.6, 0.58; for the enhancing tumor, tumor core, and whole tumor, respectively. Our method was ranked first for PED, third for MEN, and fourth for MET, respectively.
基于多参数磁共振成像进行肿瘤分割能够支持临床试验和个性化患者护理,并为临床决策过程提供定量分析。该分析有潜力影响临床决策过程,包括诊断和预后。在2023年,经过良好检验的Brain Tumor Segmentation(BraTS)挑战展示了八个任务和4,500个肿瘤病例的更大扩展。在本文中,我们提出了一个基于深度学习的集成策略,用于对三个任务(儿童脑肿瘤、颅内海绵状血管瘤和脑转移瘤)的新纳入肿瘤进行评估。特别是,我们基于先进的nnU-Net和Swin UNETR模型在区域基础上生成集成输出。此外,我们还基于交叉验证阈值搜索实现针对肿瘤亚区域的定向后处理策略,以提高分割结果。我们对三个任务上我们提出的方法未见过的测试用例进行评估,结果如下: - PED:0.653、0.809、0.826 - MEN:0.876、0.867、0.849 - MET:0.555、0.6、0.58 对于增强肿瘤、肿瘤核心和整个肿瘤,我们的方法分别获得了0.876、0.867、0.849的Dice分数。我们的方法在PED上排名第一,在MEN上排名第三,在MET上排名第四。
https://arxiv.org/abs/2409.08232
The global increase in observed forest dieback, characterised by the death of tree foliage, heralds widespread decline in forest ecosystems. This degradation causes significant changes to ecosystem services and functions, including habitat provision and carbon sequestration, which can be difficult to detect using traditional monitoring techniques, highlighting the need for large-scale and high-frequency monitoring. Contemporary developments in the instruments and methods to gather and process data at large-scales mean this monitoring is now possible. In particular, the advancement of low-cost drone technology and deep learning on consumer-level hardware provide new opportunities. Here, we use an approach based on deep learning and vegetation indices to assess crown dieback from RGB aerial data without the need for expensive instrumentation such as LiDAR. We use an iterative approach to match crown footprints predicted by deep learning with field-based inventory data from a Mediterranean ecosystem exhibiting drought-induced dieback, and compare expert field-based crown dieback estimation with vegetation index-based estimates. We obtain high overall segmentation accuracy (mAP: 0.519) without the need for additional technical development of the underlying Mask R-CNN model, underscoring the potential of these approaches for non-expert use and proving their applicability to real-world conservation. We also find colour-coordinate based estimates of dieback correlate well with expert field-based estimation. Substituting ground truth for Mask R-CNN model predictions showed negligible impact on dieback estimates, indicating robustness. Our findings demonstrate the potential of automated data collection and processing, including the application of deep learning, to improve the coverage, speed and cost of forest dieback monitoring.
全球森林退化的增加,以树叶死亡为特征,预示着广泛的森林生态系统衰败。这种退化导致生态系统服务和工作功能的重大变化,包括栖息地提供和碳储存,这些变化很难通过传统监测技术检测到,凸显了需要大范围和高频度的监测。当前在大规模数据收集和处理工具和方法的发展使这种监测成为可能。特别是,低成本无人机技术和深度学习在消费者级硬件上的进步为监测提供了新的机会。在这里,我们使用基于深度学习和植被指数的方法评估顶端死亡从彩色高空数据,无需昂贵的仪器设备(如激光雷达)。我们使用迭代方法将预测的顶端足迹与展性诱导退化的现场基线数据中的场基数据匹配,并将专家场基顶端死亡估计与植被指数基线估计进行比较。我们获得了高整体分割精度(mAP: 0.519),无需对底层Mask R-CNN模型的额外技术开发,强调了这些方法的非专家可用性和其在现实世界 conservation 中的应用前景。我们还发现,基于颜色的退化关联估计与专家场基估计非常接近。用Mask R-CNN模型预测顶端死亡替代真实世界结果对死亡估计的影响非常小,表明了鲁棒性。我们的研究结果表明,自动数据收集和处理,包括应用深度学习,可以改善森林退监测的覆盖、速度和成本。
https://arxiv.org/abs/2409.08171
Segmentation is a crucial task in the medical imaging field and is often an important primary step or even a prerequisite to the analysis of medical volumes. Yet treatments such as surgery complicate the accurate delineation of regions of interest. The BraTS Post-Treatment 2024 Challenge published the first public dataset for post-surgery glioma segmentation and addresses the aforementioned issue by fostering the development of automated segmentation tools for glioma in MRI data. In this effort, we propose two straightforward approaches to enhance the segmentation performances of deep learning-based methodologies. First, we incorporate an additional input based on a simple linear combination of the available MRI sequences input, which highlights enhancing tumors. Second, we employ various ensembling methods to weigh the contribution of a battery of models. Our results demonstrate that these approaches significantly improve segmentation performance compared to baseline models, underscoring the effectiveness of these simple approaches in improving medical image segmentation tasks.
分割在医学影像领域是一个关键的任务,通常是分析医疗图像的重要第一步,甚至可能是一个先决条件。然而,手术等治疗方法会使得感兴趣区域的准确分割变得更加复杂。 BraTS Post-Treatment 2024 挑战发布了第一个公开的分割数据集来解决上述问题,通过促进 MRI 数据中胶质的自动分割工具的发展来解决上述问题。在这个努力中,我们提出了两种简单的增强基于深度学习方法分割性能的方法。首先,我们引入了一个基于简单线性组合的额外输入,这突出了增强肿瘤。其次,我们采用各种聚类方法来权衡各种模型的贡献。我们的结果表明,这些方法显著优于基线模型,强调了这些简单方法在提高医疗图像分割任务的有效性方面的有效性。
https://arxiv.org/abs/2409.08143
Efficient point cloud coding has become increasingly critical for multiple applications such as virtual reality, autonomous driving, and digital twin systems, where rich and interactive 3D data representations may functionally make the difference. Deep learning has emerged as a powerful tool in this domain, offering advanced techniques for compressing point clouds more efficiently than conventional coding methods while also allowing effective computer vision tasks performed in the compressed domain thus, for the first time, making available a common compressed visual representation effective for both man and machine. Taking advantage of this potential, JPEG has recently finalized the JPEG Pleno Learning-based Point Cloud Coding (PCC) standard offering efficient lossy coding of static point clouds, targeting both human visualization and machine processing by leveraging deep learning models for geometry and color coding. The geometry is processed directly in its original 3D form using sparse convolutional neural networks, while the color data is projected onto 2D images and encoded using the also learning-based JPEG AI standard. The goal of this paper is to provide a complete technical description of the JPEG PCC standard, along with a thorough benchmarking of its performance against the state-of-the-art, while highlighting its main strengths and weaknesses. In terms of compression performance, JPEG PCC outperforms the conventional MPEG PCC standards, especially in geometry coding, achieving significant rate reductions. Color compression performance is less competitive but this is overcome by the power of a full learning-based coding framework for both geometry and color and the associated effective compressed domain processing.
高效的点云编码在虚拟现实、自动驾驶和数字孪系统等多个应用中变得越来越重要,这些应用中丰富的交互式3D数据表示可能会产生功能性差异。深度学习在这个领域成为一个强大的工具,提供比传统编码方法更有效的点云压缩技术,同时允许在压缩域中执行有效的计算机视觉任务,从而,第一次,为人类和机器提供共同的压缩视觉表示。 利用这个潜力,JPEG最近完成了基于JPEG Pleno学习的点云编码(PCC)标准,提供对静态点云的高效的损失y编码,通过利用深度学习模型进行几何和颜色编码。几何信息使用稀疏卷积神经网络在其原始3D形式下进行处理,而颜色数据则通过同样基于学习的JPEG AI标准进行投影并编码。本文的目标是对JPEG PCC标准进行全面的技术描述,并对其性能与最先进的水平进行深入的比较,同时突出其优势和劣势。在压缩性能方面,JPEG PCC超越了传统的MPEG PCC标准,尤其是在几何编码方面,实现了显著的压缩率降。颜色压缩性能相对较低,但通过学习式的编码框架(几何和颜色)和相关的有效压缩域处理,这是可以克服的。
https://arxiv.org/abs/2409.08130
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic $n$-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer. Our project page is available at this http URL.
3D 分割是在计算机视觉中的一个核心问题,与许多其他密集预测任务类似,它需要大量的注释数据来进行适当的训练。然而,将3D点云密集地标注以实现完全监督的训练仍然过于费力和昂贵。半监督训练提供了一个更实际的选择,其中只有少量已标注数据,同时有一个更大的未标注数据集。因此,这个领域研究了未标注数据有效利用来减少由于缺乏注释而产生的性能差距。在这个工作中,我们受到贝叶斯深度学习的启发,首先提出了一个基于贝叶斯的自训练框架来进行半监督3D语义分割。通过随机推理,我们生成一系列伪标签,然后根据估计点间的不确定性来筛选这些伪标签。通过构建一个启发式的 $n$ 部分匹配算法,我们将方法扩展到半监督3D实例分割,最后,使用相同的构建块,扩展到密集3D视觉 grounding。我们在SemanticKITTI和ScribbleKITTI上对3D语义分割的半监督方法取得了最先进的成果,同时在ScanNet和S3DIS上对3D实例分割取得了显著的改善。在ScanRefer上,我们进一步实现了比仅监督基准更显著的密集3D视觉 grounding 的改善。我们的项目页面可以在这个链接 http:// 这种方式上查看。
https://arxiv.org/abs/2409.08102
In the wake of a fabricated explosion image at the Pentagon, an ability to discern real images from fake counterparts has never been more critical. Our study introduces a novel multi-modal approach to detect AI-generated images amidst the proliferation of new generation methods such as Diffusion models. Our method, UGAD, encompasses three key detection steps: First, we transform the RGB images into YCbCr channels and apply an Integral Radial Operation to emphasize salient radial features. Secondly, the Spatial Fourier Extraction operation is used for a spatial shift, utilizing a pre-trained deep learning network for optimal feature extraction. Finally, the deep neural network classification stage processes the data through dense layers using softmax for classification. Our approach significantly enhances the accuracy of differentiating between real and AI-generated images, as evidenced by a 12.64% increase in accuracy and 28.43% increase in AUC compared to existing state-of-the-art methods.
在五角大楼遭受伪造爆炸图像的冲击之后,分辨真实图像和伪图像从未比现在更关键。我们的研究引入了一种新颖的多模态方法来检测在诸如扩散模型等新型方法盛行的环境中生成的人工图像。我们的方法 UGAD 涵盖了三个关键检测步骤:首先,我们将 RGB 图像转换为 YCbCr 通道并应用积分径操作来强调显著的径向特征。其次,使用预训练的深度学习网络进行空间平移,利用其提取最佳特征。最后,通过密集层使用 softmax 对数据进行处理进行分类。我们的方法显著增强了不同真实图像和伪图像之间的鉴别能力,据研究表明,与现有最先进的方法相比,准确率提高了 12.64%,AUC 提高了 28.43%。
https://arxiv.org/abs/2409.07913
During multimodal model training and reasoning, data samples may miss certain modalities and lead to compromised model performance due to sensor limitations, cost constraints, privacy concerns, data loss, and temporal and spatial factors. This survey provides an overview of recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning techniques. It is the first comprehensive survey that covers the historical background and the distinction between MLMM and standard multimodal learning setups, followed by a detailed analysis of current MLMM methods, applications, and datasets, concluding with a discussion about challenges and potential future directions in the field.
在多模态模型训练和推理过程中,数据样本可能因为传感器限制、成本约束、隐私问题、数据损失和时间与空间因素而错过某些模态,从而导致模型性能受损。本调查概述了在缺失模态的多模态学习(MLMM)方面的最新进展,重点关注深度学习技术。这是第一篇涵盖MLMM历史背景以及MLMM与标准多模态学习设置之间的区别的全面调查,接着对当前的MLMM方法、应用和数据集进行了详细分析,最后讨论了该领域中的挑战和潜在未来方向。
https://arxiv.org/abs/2409.07825
Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary-Aware Contrastive Loss, as the overall training objective for semi-supervised learning to mitigate the long-tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross-attention block in CMAformer effectively integrates spatial attention and channel attention for multi-scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi-supervised learning ensembles. We achieve state-of-the-art results on multiple public medical image datasets. Example code are available at: \url{this https URL}.
医学图像分割是语义分割在医疗领域中的关键应用,通过专用计算机视觉技术已经取得了显著的进步。虽然基于深度学习的医学图像分割对于辅助医学诊断至关重要,但缺乏多样化的训练数据导致长尾问题。此外,之前的大多数混合CNN-ViT架构在不同的卷积神经网络层中结合各种注意力的能力有限。为了解决这些问题,我们提出了Lagrange Duality Consistency(LDC)损失,与边界感知对比损失相结合,作为半监督学习以缓解长尾问题的总训练目标。此外,我们还引入了CMAformer,一种结合ResUNet和Transformer优势的网络。CMAformer中的交叉注意块有效地融合了多尺度特征。总体而言,我们的结果表明,CMAformer与特征融合框架和新的一致损失在半监督学习集成模型中具有很强的互补性。我们在多个公共医学图像数据集上实现了最先进的成果。示例代码可在此处查看:\url{此链接}。
https://arxiv.org/abs/2409.07793
In real-world clinical settings, data distributions evolve over time, with a continuous influx of new, limited disease cases. Therefore, class incremental learning is of great significance, i.e., deep learning models are required to learn new class knowledge while maintaining accurate recognition of previous diseases. However, traditional deep neural networks often suffer from severe forgetting of prior knowledge when adapting to new data unless trained from scratch, which undesirably costs much time and computational burden. Additionally, the sample sizes for different diseases can be highly imbalanced, with newly emerging diseases typically having much fewer instances, consequently causing the classification bias. To tackle these challenges, we are the first to propose a class-incremental learning method under limited samples in the biomedical field. First, we propose a novel cumulative entropy prediction module to measure the uncertainty of the samples, of which the most uncertain samples are stored in a memory bank as exemplars for the model's later review. Furthermore, we theoretically demonstrate its effectiveness in measuring uncertainty. Second, we developed a fine-grained semantic expansion module through various augmentations, leading to more compact distributions within the feature space and creating sufficient room for generalization to new classes. Besides, a cosine classifier is utilized to mitigate classification bias caused by imbalanced datasets. Across four imbalanced data distributions over two datasets, our method achieves optimal performance, surpassing state-of-the-art methods by as much as 53.54% in accuracy.
在实际临床环境中,随着时间的推移,数据分布会发生变化,持续有新的、有限的疾病病例输入。因此,分类增量学习具有重大意义,即深度学习模型需要在保持对之前疾病准确识别的同时学习新的类知识。然而,传统的深度神经网络通常在适应新数据时容易忘记先前的知识,这无疑会花费更多的时间和计算负担。此外,不同疾病样本的样本量可能高度失衡,新兴疾病通常实例较少,从而导致分类偏差。为解决这些挑战,我们首先在生物医学领域提出了一个类增量学习方法。首先,我们提出了一个新颖的累积熵预测模块来衡量样本的不确定性,其中最不确定的一些样本被存储在内存库中作为模型的后回顾的示例。此外,我们还理论证明了它的有效性和测量不确定性的能力。其次,我们通过各种增强技术开发了一个细粒度语义扩展模块,导致在特征空间内实现更紧凑的分布,为对新类别的泛化提供足够的空间。此外,我们还使用余弦分类器来缓解不平衡数据集引起的分类偏差。在两个数据集上的四个不平衡数据分布中,我们的方法实现最佳性能,比最先进的方法提高53.54%。
https://arxiv.org/abs/2409.07757
Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.
异常行为检测、动作识别、战斗和暴力检测是近年来引起广泛关注的一个领域。在这项工作中,我们提出了一个结合双向门控循环单元(BiGRU)和2D卷积神经网络(CNN)来检测视频序列中的暴力的架构。CNN用于从每个帧中提取空间特征,而BiGRU通过从多个帧中提取CNN提取的特征来提取时间和局部运动特征。所提出的端到端深度学习网络在三个公开数据集上的测试结果达到98%的准确率。得到的结果表明,所提出的端到端方法的性能是有前途的,证明了所提出的端到端方法的潜力。
https://arxiv.org/abs/2409.07588
Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.
近年来,由于全球大城市犯罪率的上升,暴力与异常行为检测研究受到了越来越多的关注。在这项工作中,我们提出了一个结合循环神经网络(RNN)和2维卷积神经网络(2D CNN)的深度学习架构来进行暴力检测。除了视频帧,我们还使用捕获序列计算光学流。CNN提取每个帧的空间特征,而RNN提取时间特征。利用光学流编码场景中的运动。所提出的方法达到了与最先进技术的水平相同,有时甚至超越它们。它在一共3个数据库上进行了验证,取得了良好的结果。
https://arxiv.org/abs/2409.07581
Right Heart Catheterization is a gold standard procedure for diagnosing Pulmonary Hypertension by measuring mean Pulmonary Artery Pressure (mPAP). It is invasive, costly, time-consuming and carries risks. In this paper, for the first time, we explore the estimation of mPAP from videos of noninvasive Cardiac Magnetic Resonance Imaging. To enhance the predictive capabilities of Deep Learning models used for this task, we introduce an additional modality in the form of demographic features and clinical measurements. Inspired by all-Multilayer Perceptron architectures, we present TabMixer, a novel module enabling the integration of imaging and tabular data through spatial, temporal and channel mixing. Specifically, we present the first approach that utilizes Multilayer Perceptrons to interchange tabular information with imaging features in vision models. We test TabMixer for mPAP estimation and show that it enhances the performance of Convolutional Neural Networks, 3D-MLP and Vision Transformers while being competitive with previous modules for imaging and tabular data. Our approach has the potential to improve clinical processes involving both modalities, particularly in noninvasive mPAP estimation, thus, significantly enhancing the quality of life for individuals affected by Pulmonary Hypertension. We provide a source code for using TabMixer at this https URL.
右心导管检查是通过测量平均肺动脉压力(mPAP)来诊断肺高血压的金标准操作。它是侵入性的、昂贵的、耗时的,并存在风险。在本文中,我们首次探索从非侵入性心磁共振成像视频估计mPAP的方法。为了提高用于这个任务的Deep Learning模型的预测能力,我们在形式上引入了人口学特征和临床测量。受到所有多层感知器架构的启发,我们提出了TabMixer,一种新模块,通过空间、时间和通道混合将图像和表格数据进行整合。具体来说,我们提出了利用多层感知器交换视觉模型中表格信息与图像特征的第一个方法。我们测试了TabMixer在mPAP估计方面的性能,并证明了其在同时具有图像和表格数据的情况下,能够提高卷积神经网络、3D-MLP和视觉变换器的性能,同时与之前的模块竞争。我们的方法有望改善涉及两种以上模式的临床过程,特别是非侵入性mPAP估计,从而显著提高对受肺高血压影响的人们的生活质量。我们在这个链接处提供了TabMixer的源代码:https://www. thisurl.
https://arxiv.org/abs/2409.07564
Rigid point cloud registration is a fundamental problem and highly relevant in robotics and autonomous driving. Nowadays deep learning methods can be trained to match a pair of point clouds, given the transformation between them. However, this training is often not scalable due to the high cost of collecting ground truth poses. Therefore, we present a self-distillation approach to learn point cloud registration in an unsupervised fashion. Here, each sample is passed to a teacher network and an augmented view is passed to a student network. The teacher includes a trainable feature extractor and a learning-free robust solver such as RANSAC. The solver forces consistency among correspondences and optimizes for the unsupervised inlier ratio, eliminating the need for ground truth labels. Our approach simplifies the training procedure by removing the need for initial hand-crafted features or consecutive point cloud frames as seen in related methods. We show that our method not only surpasses them on the RGB-D benchmark 3DMatch but also generalizes well to automotive radar, where classical features adopted by others fail. The code is available at this https URL .
刚性点云配准是一个基本问题,在机器人学和自动驾驶中具有很高的相关性。如今,深度学习方法可以训练以匹配它们之间的点云变换。然而,由于收集地面真实姿势的高成本,这种训练往往不可扩展。因此,我们提出了一个自蒸馏方法,以在无需标注的情况下学习点云配准。在这里,每个样本都被传递给一个教师网络和一个增强网络。教师包括一个可训练的特征提取器和一种无需训练的学习免费的鲁棒求解器(如RANSAC)。求解器强制匹配之间的对应关系并优化无监督的异质比,消除了需要地面真实标签的需求。通过消除在相关方法中需要初始手动构建的特征或连续点云帧的需求,我们的方法简化了训练程序。我们在3DMatch基准上证明了我们的方法不仅超越了它们,而且对汽车雷达等应用也表现良好。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2409.07558
Standard deep learning-based classification approaches may not always be practical in real-world clinical applications, as they require a centralized collection of all samples. Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data, which can help mitigate privacy and data ownership issues. In FL, sub-optimal convergence caused by data heterogeneity is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers. Through experimentation in this study, we show that data heterogeneity leads to the phenomenon of catastrophic forgetting during local training. We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression. To achieve this, we distill the global model resulting from each communication round. Subsequently, we use the synthetic data alongside the local data to enhance the generalization of local training. Extensive experiments show that the proposed method achieves state-of-the-art performance on both the BloodMNIST and Retina datasets, which contain label imbalance and domain shift, with an improvement in classification accuracy of up to 20%.
标准深度学习基础分类方法在现实临床应用中可能并不总是实际可行的,因为它们需要收集所有样本的集中数据。去中心化学习(FL)提供了一个范例,可以从分布式数据集中学习,而无需要求客户端共享数据,这有助于减轻隐私和数据所有权问题。在FL中,由于不同中心的数据收集协议和患者 demographic 差异,导致数据异质性导致的局部收敛是不理想的。通过本研究中的实验,我们证明了数据异质性导致在局部训练过程中出现灾难性遗忘的现象。我们提出了FedImpres,通过恢复代表全局信息的合成数据来减轻灾难性遗忘。为了实现这一目标,我们在每个通信轮次中蒸馏全局模型。随后,我们将合成数据与局部数据一起用于增强局部训练的泛化。大量实验证明,与包含标签不平衡和领域漂移的 BloodMNIST 和 Retina 数据集相比,所提出的方法在分类准确率上实现了高达20%的提高。
https://arxiv.org/abs/2409.07351
Euclidean deep learning is often inadequate for addressing real-world signals where the representation space is irregular and curved with complex topologies. Interpreting the geometric properties of such feature spaces has become paramount in obtaining robust and compact feature representations that remain unaffected by nontrivial geometric transformations, which vanilla CNNs cannot effectively handle. Recognizing rotation, translation, permutation, or scale symmetries can lead to equivariance properties in the learned representations. This has led to notable advancements in computer vision and machine learning tasks under the framework of geometric deep learning, as compared to their invariant counterparts. In this report, we emphasize the importance of symmetry group equivariant deep learning models and their realization of convolution-like operations on graphs, 3D shapes, and non-Euclidean spaces by leveraging group theory and symmetry. We categorize them as regular, steerable, and PDE-based convolutions and thoroughly examine the inherent symmetries of their input spaces and ensuing representations. We also outline the mathematical link between group convolutions or message aggregation operations and the concept of equivariance. The report also highlights various datasets, their application scopes, limitations, and insightful observations on future directions to serve as a valuable reference and stimulate further research in this emerging discipline.
Euclidean deep learning is often insufficient for addressing real-world signals where the representation space is irregular and curved with complex topologies. Interpreting the geometric properties of such feature spaces has become crucial in obtaining robust and compact feature representations that remain unaffected by nontrivial geometric transformations, which vanilla CNNs cannot effectively handle. Recognizing rotation, translation, permutation, or scale symmetries can lead to equivariance properties in the learned representations. This has led to significant advancements in computer vision and machine learning tasks under the framework of geometric deep learning, as compared to their invariant counterparts. In this report, we emphasize the importance of symmetry group equivariant deep learning models and their ability to realize convolution-like operations on graphs, 3D shapes, and non-Euclidean spaces through the leverage of group theory and symmetry. We categorize them as regular, steerable, and PDE-based convolutions and thoroughly examine the inherent symmetries of their input spaces and ensuing representations. We also outline the mathematical connection between group convolutions or message aggregation operations and the concept of equivariance. The report further highlights various datasets, their application scopes, limitations, and insightful observations on future directions, which can serve as a valuable reference and stimulate further research in this emerging field.
https://arxiv.org/abs/2409.07327
Recent advances in deep learning have markedly improved autonomous driving (AD) models, particularly end-to-end systems that integrate perception, prediction, and planning stages, achieving state-of-the-art performance. However, these models remain vulnerable to adversarial attacks, where human-imperceptible perturbations can disrupt decision-making processes. While adversarial training is an effective method for enhancing model robustness against such attacks, no prior studies have focused on its application to end-to-end AD models. In this paper, we take the first step in adversarial training for end-to-end AD models and present a novel Module-wise Adaptive Adversarial Training (MA2T). However, extending conventional adversarial training to this context is highly non-trivial, as different stages within the model have distinct objectives and are strongly interconnected. To address these challenges, MA2T first introduces Module-wise Noise Injection, which injects noise before the input of different modules, targeting training models with the guidance of overall objectives rather than each independent module loss. Additionally, we introduce Dynamic Weight Accumulation Adaptation, which incorporates accumulated weight changes to adaptively learn and adjust the loss weights of each module based on their contributions (accumulated reduction rates) for better balance and robust training. To demonstrate the efficacy of our defense, we conduct extensive experiments on the widely-used nuScenes dataset across several end-to-end AD models under both white-box and black-box attacks, where our method outperforms other baselines by large margins (+5-10%). Moreover, we validate the robustness of our defense through closed-loop evaluation in the CARLA simulation environment, showing improved resilience even against natural corruption.
近年来,在深度学习领域的发展已经显著地提高了自动驾驶(AD)模型的性能,尤其是将感知、预测和规划阶段集成的端到端系统,实现了最先进的性能。然而,这些模型仍然容易受到对抗攻击的影响,其中人类无法感知的扰动会干扰决策过程。虽然对抗训练是一种提高模型对抗性攻击的有效方法,但之前没有研究专门针对端到端AD模型应用对抗训练。在本文中,我们迈出了第一个针对端到端AD模型的对抗训练步骤,并提出了一个新的模块级自适应对抗训练(MA2T)。然而,将对抗训练扩展到这种语境中是非常困难的,因为模型中不同的阶段具有不同的目标,并且彼此之间存在很强的相互联系。为解决这些挑战,MA2T首先引入了模块级噪声注入,在输入的不同模块之前注入噪声,旨在指导整个目标而不是每个独立模块的损失。此外,我们还引入了动态权重累积适应,根据每个模块的贡献(累积减少率)动态地学习和调整每个模块的损失权重,以实现更好的平衡和鲁棒训练。为了验证我们的防御系统的有效性,我们在多个端到端AD模型(包括白盒和黑盒攻击)的广泛使用的nuScenes数据集上进行了广泛的实验,我们的方法在攻击面前取得了巨大的优势 (+5-10)。此外,我们还通过在CARLA模拟环境中进行闭环评估来验证我们防御系统的鲁棒性,甚至对抗自然腐蚀也表现出了更好的抵抗力。
https://arxiv.org/abs/2409.07321
Automated pavement monitoring using computer vision can analyze pavement conditions more efficiently and accurately than manual methods. Accurate segmentation is essential for quantifying the severity and extent of pavement defects and consequently, the overall condition index used for prioritizing rehabilitation and maintenance activities. Deep learning-based segmentation models are however, often supervised and require pixel-level annotations, which can be costly and time-consuming. While the recent evolution of zero-shot segmentation models can generate pixel-wise labels for unseen classes without any training data, they struggle with irregularities of cracks and textured pavement backgrounds. This research proposes a zero-shot segmentation model, PaveSAM, that can segment pavement distresses using bounding box prompts. By retraining SAM's mask decoder with just 180 images, pavement distress segmentation is revolutionized, enabling efficient distress segmentation using bounding box prompts, a capability not found in current segmentation models. This not only drastically reduces labeling efforts and costs but also showcases our model's high performance with minimal input, establishing the pioneering use of SAM in pavement distress segmentation. Furthermore, researchers can use existing open-source pavement distress images annotated with bounding boxes to create segmentation masks, which increases the availability and diversity of segmentation pavement distress datasets.
使用计算机视觉进行自动沥青路面监测可以比手动方法更高效、更准确地分析路面状况。准确的分割对于计算路面缺陷的严重程度和范围非常重要,从而为 prioritizing 维修和养护活动提供整体状况指数。然而,基于深度学习的分割模型通常需要监督,并且需要像素级的注释,这可能代价昂贵且耗时。虽然最近基于零击分割模型的进化可以生成未见类别的像素级标签,而不需要任何训练数据,但他们仍难以处理裂缝和纹理路面背景的不规则性。为了解决这个问题,这项研究提出了一个基于零击分割的模型——PaveSAM,它可以通过边界框提示对路面损伤进行分割。通过仅使用180个图像重新训练SAM的掩码解码器,路基损伤分割取得了革命性的进展,实现了使用边界框提示进行高效损伤分割。这不仅大大减少了标注工作量并降低了成本,而且展示了我们模型在最少输入的情况下具有高效性能,奠定了其在路面损伤分割领域的开创性应用。此外,研究人员还可以使用已标注的具有边界框的沥青路面损伤图像来创建分割掩码,从而增加分段路面损伤数据集的可用性和多样性。
https://arxiv.org/abs/2409.07295
Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based SSL methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data and substantial computational resources. Moreover, they often rely on large-scale annotated spatial data and may struggle when adapting to evolving sound classes. To mitigate these challenges, we propose a novel Class Incremental Learning (CIL) approach, termed SSL-CIL, which avoids serious accuracy degradation due to catastrophic forgetting by incrementally updating the DL-based SSL model through a closed-form analytic solution. In particular, data privacy is ensured since the learning process does not revisit any historical data (exemplar-free), which is more suitable for smart home scenarios. Empirical results in the public SSLR dataset demonstrate the superior performance of our proposal, achieving a localization accuracy of 90.9%, surpassing other competitive methods.
声音源定位(SSL)启用了技术,应用于诸如监视和机器人技术等场景。虽然传统的信号处理(SP)方法在特定信号和噪声假设下提供分析解决方案,但最近基于深度学习(DL)的方法显著超过了它们。然而,它们的成功取决于广泛的训练数据和大量的计算资源。此外,它们通常依赖于大规模注释的局部数据,并且在适应不断变化的声类时可能会遇到困难。为了减轻这些挑战,我们提出了一个名为SSL-CIL的新分类递增学习(CIL)方法,通过通过闭式形式分析解决方案来逐步更新基于DL的SSL模型,从而避免了由于灾难性遗忘而导致的严重准确率下降。特别地,学习过程不会回归任何历史数据(示例-无),这使得智能家居场景更加适合。在公共SSLR数据集上的实证结果表明,与其他竞争方法相比,我们的建议具有优越的性能,达到90.9%的定位准确率,超过了其他竞争方法。
https://arxiv.org/abs/2409.07224
Diffusion-weighted imaging (DWI) is a type of Magnetic Resonance Imaging (MRI) technique sensitised to the diffusivity of water molecules, offering the capability to inspect tissue microstructures and is the only in-vivo method to reconstruct white matter fiber tracts non-invasively. The DWI signal can be analysed with the diffusion tensor imaging (DTI) model to estimate the directionality of water diffusion within voxels. Several scalar metrics, including axial diffusivity (AD), mean diffusivity (MD), radial diffusivity (RD), and fractional anisotropy (FA), can be further derived from DTI to quantitatively summarise the microstructural integrity of brain tissue. These scalar metrics have played an important role in understanding the organisation and health of brain tissue at a microscopic level in clinical studies. However, reliable DTI metrics rely on DWI acquisitions with high gradient directions, which often go beyond the commonly used clinical protocols. To enhance the utility of clinically acquired DWI and save scanning time for robust DTI analysis, this work proposes DirGeo-DTI, a deep learning-based method to estimate reliable DTI metrics even from a set of DWIs acquired with the minimum theoretical number (6) of gradient directions. DirGeo-DTI leverages directional encoding and geometric constraints to facilitate the training process. Two public DWI datasets were used for evaluation, demonstrating the effectiveness of the proposed method. Extensive experimental results show that the proposed method achieves the best performance compared to existing DTI enhancement methods and potentially reveals further clinical insights with routine clinical DWI scans.
扩散加权成像(DWI)是一种对水分子扩散敏感的磁共振成像(MRI)技术,具有观察组织显微结构的能力,是唯一一种不需要进行手术操作就能非侵入性地重建白质纤维束的体内方法。DWI信号可以使用扩散张量成像(DTI)模型进行分析,以估计水分子扩散的方向性。通过DTI可以进一步计算出几个标量指标,包括轴向扩散(AD)、平均扩散(MD)、径向扩散(RD)和分数各向同性(FA),以定量描述脑组织微观结构的完整性。这些标量指标在临床研究中对理解脑组织结构和健康状况具有重要意义。然而,可靠的DTI指标依赖于高梯度方向的DWI acquisition,这通常超出了常用的临床协议。为了提高临床获得的DWI的实用性,并节省分析时间,本文提出了一种基于深度学习的方法,可以从一组最小理论梯度方向(6)获得的DWI中估计可靠的DTI指标。DirGeo-DTI利用方向编码和几何约束来促进训练过程。两个公开的DWI数据集用于评估,证明了所提出的方法的有效性。大量的实验结果表明,与现有的DTI增强方法相比,所提出的方法取得了最佳性能,可能进一步揭示出临床常规DWI扫描中的临床洞察。
https://arxiv.org/abs/2409.07186
Biometric authentication has garnered significant attention as a secure and efficient method of identity verification. Among the various modalities, hand vein biometrics, including finger vein, palm vein, and dorsal hand vein recognition, offer unique advantages due to their high accuracy, low susceptibility to forgery, and non-intrusiveness. The vein patterns within the hand are highly complex and distinct for each individual, making them an ideal biometric identifier. Additionally, hand vein recognition is contactless, enhancing user convenience and hygiene compared to other modalities such as fingerprint or iris recognition. Furthermore, the veins are internally located, rendering them less susceptible to damage or alteration, thus enhancing the security and reliability of the biometric system. The combination of these factors makes hand vein biometrics a highly effective and secure method for identity verification. This review paper delves into the latest advancements in deep learning techniques applied to finger vein, palm vein, and dorsal hand vein recognition. It encompasses all essential fundamentals of hand vein biometrics, summarizes publicly available datasets, and discusses state-of-the-art metrics used for evaluating the three modes. Moreover, it provides a comprehensive overview of suggested approaches for finger, palm, dorsal, and multimodal vein techniques, offering insights into the best performance achieved, data augmentation techniques, and effective transfer learning methods, along with associated pretrained deep learning models. Additionally, the review addresses research challenges faced and outlines future directions and perspectives, encouraging researchers to enhance existing methods and propose innovative techniques.
生物识别作为一种安全高效的的身份验证方法已经引起了广泛关注。在各种模式中,包括手指静脉、手掌静脉和桡动脉静脉识别,由于其高准确度、低伪造倾向和非侵入性,手部静脉模式具有独特的优势。此外,手部静脉识别是接触less的,比其他识别模式如指纹或虹膜识别更方便用户,同时提高了系统的可靠性和安全性。再者,静脉内部位置,使它们对损坏或改变的影响较小,从而提高了生物识别系统的安全性和可靠性。这些因素的结合使手部静脉生物识别成为一种非常有效且安全的方法。 本文回顾了在手指静脉、手掌静脉和桡动脉静脉识别中应用的深度学习技术的最新进展。它涵盖了所有手部静脉生物识别的基本原理,总结了已公开的数据集,并讨论了用于评估三种模式的现有最先进的指标。此外,它还提供了关于手指、手掌、桡动脉和多模态静脉技术的建议方法,以及实现最佳性能、数据增强技术和有效迁移学习方法的相关信息,以及相关的预训练深度学习模型。此外,本文还讨论了研究所面临的挑战和未来研究方向,鼓励研究人员改进现有方法并提出创新技术。
https://arxiv.org/abs/2409.07128