We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
Unsupervised Domain Adaptation (UDA) is crucial for reducing the need for extensive manual data annotation when training deep networks on point cloud data. A significant challenge of UDA lies in effectively bridging the domain gap. To tackle this challenge, we propose \textbf{C}urvature \textbf{D}iversity-Driven \textbf{N}uclear-Norm Wasserstein \textbf{D}omain Alignment (CDND). Our approach first introduces a \textit{\textbf{Curv}ature Diversity-driven Deformation \textbf{Rec}onstruction (CurvRec)} task, which effectively mitigates the gap between the source and target domains by enabling the model to extract salient features from semantically rich regions of a given point cloud. We then propose \textit{\textbf{D}eformation-based \textbf{N}uclear-norm \textbf{W}asserstein \textbf{D}iscrepancy (D-NWD)}, which applies the Nuclear-norm Wasserstein Discrepancy to both \textit{deformed and original} data samples to align the source and target domains. Furthermore, we contribute a theoretical justification for the effectiveness of D-NWD in distribution alignment and demonstrate that it is \textit{generic} enough to be applied to \textbf{any} deformations. To validate our method, we conduct extensive experiments on two public domain adaptation datasets for point cloud classification and segmentation tasks. Empirical experiment results show that our CDND achieves state-of-the-art performance by a noticeable margin over existing approaches.
无需大量手动数据注释,训练深度网络在点云数据上具有监督学习(UDA)对减少在训练过程中需要的大规模手动数据注释的需求至关重要。UDA的一个关键挑战是有效地弥合领域差距。为解决这一挑战,我们提出了一种名为 \textbf{C}urvature \textbf{D}iversity-Driven \textbf{N}uclear-Norm Wasserstein \textbf{D}domain Alignment (CDND) 的方法。我们的方法首先引入了一个 \textit{\textbf{Curv}ature Diversity-driven Deformation \textbf{Rec}onstruction (CurvRec)} 任务,通过使模型从给定点云中的语义丰富区域提取显著特征,有效缓解了源域和目标域之间的差距。然后,我们提出了一个 \textit{\textbf{D}eformation-based \textbf{N}uclear-norm \textbf{W}asserstein \textbf{D}iscrepancy (D-NWD)}$$ 方法,将核范数 Wasserstein 距离应用于所有变形和原始数据样本,以对源域和目标域进行对齐。此外,我们还提供了 D-NWD 在分布对齐的有效性理论证明,并表明它足够通用,可以应用于任何变形。为了验证我们的方法,我们在两个公开点云分类和分割数据集上进行了广泛的实验。实验结果表明,我们的 CDND 通过显著的领先优势超越了现有方法,实现了最先进的性能。
https://arxiv.org/abs/2410.02720
Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at this https URL.
自回归(AR)模型将图像生成重新定义为下一个词的预测,展示了惊人的潜力和与扩散模型的崛起成为强大的竞争对手。然而,在AR模型中,控制到图像生成类似于ControlNet的控制仍然很大程度上没有被探索。虽然自然的方法,受到大型语言模型发展的启发,是将控制图像元词化,并将它们填入自回归模型在进行图像token解码之前,但它仍然在生成质量上比ControlNet和低效。因此,我们引入了ControlAR,一个用于将空间控制集成到自回归图像生成模型的有效而有效的框架。首先,我们探讨了AR模型的控制编码,并提出了一个轻量级的控制编码器,将空间输入(例如,锐利的边缘或深度图)转换为控制token。然后,ControlAR利用条件解码方法,根据控制和图像token之间的自适应融合生成下一个图像token,类似于位置编码。与预填充的token相比,使用条件解码显著增强了AR模型的控制能力,同时保持了模型的效率。此外,与自适应扩散模型(例如,ControlNet++)相比,所提出的ControlAR通过条件解码在各种输入上实现了任意分辨率图像生成。丰富的实验结果可以证明,所提出的ControlAR在各种输入上实现了自适应的控制,包括边缘、深度和分割掩码。此外,定量和定性结果表明,与最先进的可控制扩散模型(例如,ControlNet++)相比,ControlAR具有超越性的能力。代码、模型和演示文稿很快将在这个https://此URL上提供。
https://arxiv.org/abs/2410.02705
The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare 11 open-source tools for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation.
翻译:分割性能评估是生物医学图像分析中一个常见任务,这在最近发布的指标选择指南和计算框架中得到了强调。为了定量评估两个分割的对齐度,研究人员通常采用计数指标,如Dice相似度系数,或基于距离的指标,如Hausdorff距离,这些指标通常由公开可用的开源工具计算,并具有固有假设,即这些工具提供一致的结果。在本研究中,我们质疑了这一假设,并与现实世界的临床数据一起进行了系统性的实现分析,以比较11个基于距离的指标计算开放源工具与我们的高度准确的网格基础参考实现之间的差异。结果显示,所有开源工具之间的统计学显著差异都是令人惊讶且令人担忧的,因为它们质疑了现有研究的有效性。除了确定主要变化来源外,我们还提供了基于距离的指标计算的建议。
https://arxiv.org/abs/2410.02630
Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: this https URL
大语言模型(LLMs)因其在文本数据中的多才多艺而受到越来越多的关注,它们越来越被探索用于提高医学图像分割的潜力,这是准确诊断成像的关键任务。本研究探讨通过将预训练LLM变换器模块集成到基于ViT的模型的编码器中,提高医学图像分割性能的方法。我们的方法是将冻结的LLM变换器模块融入ViT模型的编码器中,从而在各种医学成像模式上显著提高分割性能。我们提出了一种混合注意机制,结合全局和局部特征学习以及多尺度融合块来聚合不同尺度上的特征。增强后的模型显示出显著的性能提升,包括平均Dice得分从0.74增加到0.79以及准确率、精确率和Jaccard指数的改善。这些结果证明了基于LLM的变换器在优化医学图像分割方面的有效性,突出了它们在提高模型准确性和鲁棒性方面的巨大潜力。源代码和我们的实现可以在以下链接中找到:https://this URL
https://arxiv.org/abs/2410.02458
Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.
大量研究表明,基于Vision Transformer(ViT)的方法在各种计算机视觉任务中表现出了强大的性能。然而,ViT模型通常很难有效地捕捉图像中的高频成分,这些成分对于检测小目标并保留边缘细节在复杂场景中至关重要。在结肠癌分割等任务中,这种局限尤其具有挑战性,因为结肠癌在结构和纹理上表现出很大变异性。高频信息(如边界细节)对于实现这一任务的精确语义分割至关重要。为解决这些挑战,我们提出了HiFiSeg,一种用于结肠癌分割的新网络,通过全局局部视觉Transformer框架增强高频信息处理能力。HiFiSeg利用金字塔视觉Transformer(PVT)作为编码器,并引入了两个关键模块:全局-局部交互模块(GLIM)和选择性聚合模块(SAM)。GLIM采用并行结构将全局和局部信息在多个尺度上融合,有效捕捉细粒度特征。SAM选择性地将低层特征的边界细节与高层特征的语义信息相结合,显著提高了模型准确检测和分割结肠癌的能力。在五个广泛认可的基准数据集上进行的大量实验证明,HiFiSeg在结肠癌分割方面具有很好的效果。值得注意的是,CVC-ColonDB和ETIS数据集上的mDice得分分别达到0.826和0.822,分别表明HiFiSeg在处理这一任务的复杂性方面具有卓越的表现。
https://arxiv.org/abs/2410.02528
Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. Although models based on convolutional neural networks (CNNs) and Transformers have achieved remarkable success in medical image segmentation tasks, they still face challenges such as high computational complexity and the loss of local features when capturing long-range dependencies. To address these limitations, we propose Med-TTT, a visual backbone network integrated with Test-Time Training (TTT) layers, which incorporates dynamic adjustment capabilities. Med-TTT introduces the Vision-TTT layer, which enables effective modeling of long-range dependencies with linear computational complexity and adaptive parameter adjustment during inference. Furthermore, we designed a multi-resolution fusion mechanism to combine image features at different scales, facilitating the identification of subtle lesion characteristics in complex backgrounds. At the same time, we adopt a frequency domain feature enhancement strategy based on high pass filtering, which can better capture texture and fine-grained details in images. Experimental results demonstrate that Med-TTT significantly outperforms existing methods on multiple medical image datasets, exhibiting strong segmentation capabilities, particularly in complex image backgrounds. The model achieves leading performance in terms of accuracy, sensitivity, and Dice coefficient, providing an efficient and robust solution for the field of medical image this http URL code is available at this https URL .
医学图像分割在临床诊断和治疗规划中发挥着关键作用。虽然基于卷积神经网络(CNN)和Transformer的模型在医学图像分割任务中取得了显著的成功,但它们仍然面临着一些挑战,如高计算复杂度以及在捕捉长距离依赖时丢失局部特征。为了应对这些局限,我们提出了Med-TTT,一种将Test-Time Training(TTT)层与视觉骨干网络集成在一起的模型,具有动态调整功能。Med-TTT引入了Vision-TTT层,在保持线性计算复杂度的同时,在推理过程中有效建模长距离依赖。此外,我们还设计了一个多分辨率融合机制,将不同尺度下的图像特征进行组合,有助于在复杂背景下识别微妙病变特征。同时,我们采用基于高斯滤波器的频域特征增强策略,可以更好地捕捉图像中的纹理和细小细节。实验结果表明,Med-TTT在多个医学图像数据集上显著优于现有方法,具有强大的分割能力,特别是在复杂图像背景下。该模型在准确率、敏感度和Dice系数方面均取得领先地位,为医学图像分割领域提供了一种高效且可靠的解决方案。您可以在此处访问该模型的原始论文:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222125/。
https://arxiv.org/abs/2410.02523
Clinnova, a collaborative initiative involving France, Germany, Switzerland, and Luxembourg, is dedicated to unlocking the power of precision medicine through data federation, standardization, and interoperability. This European Greater Region initiative seeks to create an interoperable European standard using artificial intelligence (AI) and data science to enhance healthcare outcomes and efficiency. Key components include multidisciplinary research centers, a federated biobanking strategy, a digital health innovation platform, and a federated AI strategy. It targets inflammatory bowel disease, rheumatoid diseases, and multiple sclerosis (MS), emphasizing data quality to develop AI algorithms for personalized treatment and translational research. The IHU Strasbourg (Institute of Minimal-invasive Surgery) has the lead in this initiative to develop the federated learning (FL) proof of concept (POC) that will serve as a foundation for advancing AI in healthcare. At its core, Clinnova-MS aims to enhance MS patient care by using FL to develop more accurate models that detect disease progression, guide interventions, and validate digital biomarkers across multiple sites. This technical report presents insights and key takeaways from the first cross-border federated POC on MS segmentation of MRI images within the Clinnova framework. While our work marks a significant milestone in advancing MS segmentation through cross-border collaboration, it also underscores the importance of addressing technical, logistical, and ethical considerations to realize the full potential of FL in healthcare settings.
Clinnova是一个涉及法国、德国、瑞士和卢森堡的合作倡议,旨在通过数据联合、标准化和互操作性来释放精准医学的力量。这个欧洲大区倡议旨在利用人工智能(AI)和数据科学创建一个可互操作的欧洲标准,以提高医疗成果和效率。关键组成部分包括多学科研究中心、联邦生物银行策略、数字健康创新平台和联邦AI策略。它重点关注炎症性结肠疾病、类风湿性关节炎和多发性硬化(MS),强调数据质量以开发AI算法进行个性化治疗和转化研究。IHU Strasbourg(Institute of Minimal-invasive Surgery)在这个倡议中处于领先地位,开发联邦学习(FL)概念证明(POC),成为推动AI在医疗领域发展的重要基础。 Clinnova-MS的目标是,通过FL在多个站点开发更准确的模型,提高MS患者的护理水平,通过检测疾病进展、指导干预和验证数字生物标志物来检测疾病的多个站点。本技术报告概述了在Clinnova框架内进行的第一届跨境多学科POC对MS分型的见解和关键要点。虽然我们的工作在推动MS分开展方面取得了显著的里程碑,但同时也突显了在实现FL在医疗环境中的全部潜力方面,需要解决技术、物流和伦理等实际问题的重要性。
https://arxiv.org/abs/2410.02443
Contrastive learning has become a dominant approach in self-supervised visual representation learning, with hard negatives-samples that closely resemble the anchor-being key to enhancing the discriminative power of learned representations. However, efficiently leveraging hard negatives remains a challenge due to the difficulty in identifying and incorporating them without significantly increasing computational costs. To address this, we introduce SynCo (Synthetic Negatives in Contrastive learning), a novel contrastive learning approach that improves model performance by generating synthetic hard negatives. Built on the MoCo framework, SynCo introduces six novel strategies for creating diverse synthetic hard negatives that can be generated on-the-fly with minimal computational overhead. SynCo achieves faster training and better representation learning, achieving a top-1 accuracy of 68.1% in ImageNet linear evaluation after only 200 epochs on pretraining, surpassing MoCo's 67.5% with the same ResNet-50 encoder. Additionally, it transfers more effectively to detection tasks: on the PASCAL VOC, it outperforms both the supervised baseline and MoCo, achieving an AP of 82.5%; on the COCO dataset, it sets a new benchmark with 40.4% AP for bounding box detection and 35.4% AP for instance segmentation. Our synthetic hard negative generation procedure significantly enhances the quality of visual representations learned through self-supervised contrastive learning. Code is available at this https URL.
对比学习已成为自监督视觉表示学习的主导方法,其中具有困难的负样本,这些负样本与学习到的表示的判别力密切相关,可以增强所学到的表示的判别力。然而,有效地利用困难的负样本仍然具有挑战性,因为很难在不显著增加计算成本的情况下,准确地识别和包含它们。为了应对这个问题,我们引入了SynCo(在对比学习中生成合成负样本),一种新颖的对比学习方法,通过生成合成负样本来提高模型性能。SynCo基于MoCo框架,引入了六个新颖的策略,可以在无需大量计算开销的情况下生成多样性的合成负样本。SynCo实现了更快的训练和更好的表示学习,在仅经过200个周期预训练后,ImageNet线性评估的准确率达到了68.1%,超过了使用相同ResNet-50编码器的MoCo的67.5%。此外,它在对检测任务上的转移效果上也表现更出色:在PASCAL VOC上,它超过了监督基线和MoCo,实现了82.5%的AP;在COCO数据集上,它为边界框检测和实例分割设置了新的基准,分别为40.4%和35.4%的AP。我们生成的合成负样本处理过程显著提高了通过自监督对比学习获得的视觉表示的质量。代码可以从这个链接下载:https://www.kaggle.com/your_username/synco
https://arxiv.org/abs/2410.02401
The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.
扩散模型不仅在图像生成领域取得了显著的成就,还在利用未标注数据作为有效预训练方法方面展现了其潜力。从扩散模型在语义匹配和开放词汇分割领域展示的广泛潜力中,我们的工作开始了研究,探讨使用潜在扩散模型进行少样本语义分割。近年来,受到大型语言模型在上下文理解能力的影响,少样本语义分割已经演变成评估通用分割模型的关键要素。在这种情况下,我们专注于少样本语义分割,为基于扩散模型的通用分割模型的发展奠定了坚实的基础。我们的初始关注点在于理解如何促进查询图像和支持图像之间的交互,从而在自注意力框架内提出KV融合方法。随后,我们深入研究了如何优化支持掩码中信息的注入以及同时重新评估如何从查询掩码提供合理的监督。根据我们的分析,我们建立了一个简单而有效的框架,名为DiffewS,保留了原始潜在扩散模型的生成框架,并有效利用了预训练的先前知识。实验结果表明,我们的方法在多个设置中显著优于之前的最佳模型。
https://arxiv.org/abs/2410.02369
Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.
Mamba,一种 State Space Model 的特殊情况,正在成为医学图像分析中模板为基础的深度学习方法的替代品。尽管 Transformer 是一种强大的架构,但它们存在一些局限性,包括二次计算复杂性和无法有效地解决长距离依赖问题。这种局限性影响到医疗影像大数据的分析,其中存在许多空间和时间关系。相比之下,Mamba 提供了在医学图像分析中具有优势的益处。它具有线性时间复杂性,这是 Transformer 的重大改进。Mamba 在没有注意力机制的情况下处理较长的序列,实现更快的推理并需要更少的内存。Mamba 还展示了在合并多模态数据方面的强大性能,提高诊断准确性和患者 outcomes。本文的组织使读者能够逐步了解 Mamba 在医学影像分析中的能力。我们首先定义了 State Space Model 和模型的核心概念,包括 S4、S5 和 S6,接着探讨了 Mamba 的架构,如纯 Mamba、U-Net 变体和具有卷积神经网络、Transformer 和 Graph Neural Networks 的混合模型。我们还涵盖了 Mamba 的优化、技术和适应性,扫描、数据集、应用、实验结果,并最后结论与挑战及未来在医学影像领域的发展趋势。本综述旨在展示 Mamba 在克服现有医疗影像工作中的障碍的同时,为该领域推动创新进展奠定基础。本工作中回顾了在医学领域应用的 Mamba 架构的完整列表,可在 Github 上查看。
https://arxiv.org/abs/2410.02362
3D instance segmentation is crucial for obtaining an understanding of a point cloud scene. This paper presents a novel neural network architecture for performing instance segmentation on 3D point clouds. We propose to jointly learn coefficients and prototypes in parallel which can be combined to obtain the instance predictions. The coefficients are computed using an overcomplete set of sampled points with a novel multi-scale module, dubbed dilated point inception. As the set of obtained instance mask predictions is overcomplete, we employ a non-maximum suppression algorithm to retrieve the final predictions. This approach allows to omit the time-expensive clustering step and leads to a more stable inference time. The proposed method is not only 28% faster than the state-of-the-art, it also exhibits the lowest standard deviation. Our experiments have shown that the standard deviation of the inference time is only 1.0% of the total time while it ranges between 10.8 and 53.1% for the state-of-the-art methods. Lastly, our method outperforms the state-of-the-art both on S3DIS-blocks (4.9% in mRec on Fold-5) and PartNet (2.0% on average in mAP).
3D实例分割对于获取点云场景的理解至关重要。本文提出了一种新颖的神经网络架构,用于在3D点上进行实例分割。我们提出了一种并行学习系数和原型的方式,可以组合以获得实例预测。系数通过使用名为扩散点嵌入的多尺度模块计算。由于得到的实例掩码预测集是过完整的,我们采用非最大抑制算法来检索最后的预测。这种方法允许省略时间昂贵的聚类步骤,从而实现更稳定的推理时间。所提出的方法不仅比现有方法快28%,还具有最低的标准差。我们的实验结果表明,推理时间的标准差仅占总时间的1.0%,而它对于最佳方法的标准差范围为10.8%至53.1%。最后,我们的方法在S3DIS-blocks(fold-5上的mRec为4.9%)和PartNet(平均mAP为2.0%)上均优于现有方法。
https://arxiv.org/abs/2410.02352
3D scene understanding is crucial for facilitating seamless interaction between digital devices and the physical world. Real-time capturing and processing of the 3D scene are essential for achieving this seamless integration. While existing approaches typically separate acquisition and processing for each frame, the advent of resolution-scalable 3D sensors offers an opportunity to overcome this paradigm and fully leverage the otherwise wasted acquisition time to initiate processing. In this study, we introduce VX-S3DIS, a novel point cloud dataset accurately simulating the behavior of a resolution-scalable 3D sensor. Additionally, we present RESSCAL3D++, an important improvement over our prior work, RESSCAL3D, by incorporating an update module and processing strategy. By applying our method to the new dataset, we practically demonstrate the potential of joint acquisition and semantic segmentation of 3D point clouds. Our resolution-scalable approach significantly reduces scalability costs from 2% to just 0.2% in mIoU while achieving impressive speed-ups of 15.6 to 63.9% compared to the non-scalable baseline. Furthermore, our scalable approach enables early predictions, with the first one occurring after only 7% of the total inference time of the baseline. The new VX-S3DIS dataset is available at this https URL.
3D场景理解对于促进数字设备与物理世界之间的无缝互动至关重要。对3D场景的实时捕捉和处理对于实现这种无缝集成至关重要。虽然现有的方法通常将每个帧的获取和处理分开,但分辨率可扩展的3D传感器的出现为克服这种范式提供了机会,并充分利用了浪费的获取时间来启动处理。在这项研究中,我们引入了VX-S3DIS,一个准确模拟分辨率可扩展3D传感器行为的新型点云数据集。此外,我们还介绍了RESSCAL3D++,这是我们对之前工作的重大改进,通过包括更新模块和处理策略。通过将我们的方法应用于新数据集,我们实际上证明了3D点云的联合获取和语义分割的潜力。我们的分辨率可扩展方法在mIoU方面的可扩展性成本显著从2%降低到0.2%,同时比非可扩展基线实现了令人印象深刻的速度提升(15.6%到63.9%)。此外,我们的可扩展方法还允许早期预测,第一个发生在大约7%的基准推理时间之后。新的VX-S3DIS数据集可以在这个https:// URL上找到。
https://arxiv.org/abs/2410.02323
Medical image analysis tasks often focus on regions or structures located in a particular location within the patient's body. Often large parts of the image may not be of interest for the image analysis task. When using deep-learning based approaches, this causes an unnecessary increases the computational burden during inference and raises the chance of errors. In this paper, we introduce CTARR, a novel generic method for CT Anatomical Region Recognition. The method serves as a pre-processing step for any deep learning-based CT image analysis pipeline by automatically identifying the pre-defined anatomical region that is relevant for the follow-up task and removing the rest. It can be used in (i) image segmentation to prevent false positives in anatomically implausible regions and speeding up the inference, (ii) image classification to produce image crops that are consistent in their anatomical context, and (iii) image registration by serving as a fast pre-registration step. Our proposed method is based on atlas registration and provides a fast and robust way to crop any anatomical region encoded as one or multiple bounding box(es) from any unlabeled CT scan of the brain, chest, abdomen and/or pelvis. We demonstrate the utility and robustness of the proposed method in the context of medical image segmentation by evaluating it on six datasets of public segmentation challenges. The foreground voxels in the regions of interest are preserved in the vast majority of cases and tasks (97.45-100%) while taking only fractions of a seconds to compute (0.1-0.21s) on a deep learning workstation and greatly reducing the segmentation runtime (2.0-12.7x). Our code is available at this https URL.
医学图像分析任务通常集中于患者体内特定位置的部位或结构。通常,图像中很大一部分内容可能不适用于图像分析任务。当使用基于深度学习的图像处理方法时,这会导致在推理过程中计算负担的无谓增加,并增加错误的可能性。在本文中,我们介绍了一种名为CTARR的新通用方法,用于CT解剖区域识别。该方法作为任何基于深度学习的CT图像分析流程的预处理步骤,通过自动识别预定义的解剖区域,对后续任务相关区域进行去除。它可用于(i)图像分割,以防止在解剖上不相关区域出现假阳性结果并加快推理速度,(ii)图像分类,以生成具有相同解剖上下文的图像片段,和(iii)图像配准,作为快速预注册步骤。我们所提出的方法基于解剖配准,提供了一种快速且鲁棒的方法,从任何未标记的CT扫描中裁剪出任何编码为一个或多个边界框的解剖区域。我们在公共分割挑战数据集上评估所提出方法的应用价值。在大多数情况下,感兴趣区域的前景像素得以保留(97.45-100%),而仅需花费分数秒钟在深度学习工作站上计算(0.1-0.21s),大大减少了分割时间(2.0-12.7x)。我们的代码可在此https://url上获取。
https://arxiv.org/abs/2410.02316
Historical maps are invaluable for analyzing long-term changes in transportation and spatial development, offering a rich source of data for evolutionary studies. However, digitizing and classifying road networks from these maps is often expensive and time-consuming, limiting their widespread use. Recent advancements in deep learning have made automatic road extraction from historical maps feasible, yet these methods typically require large amounts of labeled training data. To address this challenge, we introduce a novel framework that integrates deep learning with geoinformation, computer-based painting, and image processing methodologies. This framework enables the extraction and classification of roads from historical maps using only road geometries without needing road class labels for training. The process begins with training of a binary segmentation model to extract road geometries, followed by morphological operations, skeletonization, vectorization, and filtering algorithms. Synthetic training data is then generated by a painting function that artificially re-paints road segments using predefined symbology for road classes. Using this synthetic data, a deep ensemble is trained to generate pixel-wise probabilities for road classes to mitigate distribution shift. These predictions are then discretized along the extracted road geometries. Subsequently, further processing is employed to classify entire roads, enabling the identification of potential changes in road classes and resulting in a labeled road class dataset. Our method achieved completeness and correctness scores of over 94% and 92%, respectively, for road class 2, the most prevalent class in the two Siegfried Map sheets from Switzerland used for testing. This research offers a powerful tool for urban planning and transportation decision-making by efficiently extracting and classifying roads from historical maps.
历史地图对于分析交通和空间发展长期变化提供了宝贵的数据,为进化研究提供了丰富的数据来源。然而,从这些地图中数字化和分类道路网络通常费用昂贵且耗时,限制了它们的应用范围。近年来,深度学习的进步使得从历史地图中自动提取道路成为可能,然而这些方法通常需要大量标记训练数据。为了应对这一挑战,我们引入了一个新框架,将深度学习与地理信息、基于计算机的绘画和图像处理方法相结合。该框架使用仅基于道路几何结构的提取和分类道路网络,无需为训练提供道路类标签。 过程始于对二元分割模型的训练,以提取道路几何结构,然后进行形态学操作、骨架化、向量化过滤算法。接下来,通过一个绘图函数使用预定义的符号学对道路类别进行重新绘制。使用这种合成数据,我们训练了一个深度集成来生成像素级的概率对道路类别进行预测,从而减轻分布漂移。然后,在提取的道路几何结构的范围内进一步处理,以对整个道路进行分类,从而确定道路类别的潜在变化,形成一个带标签的道路类别数据集。 我们的方法在瑞士苏格菲尔德地图纸上测试的两个最常见的道路类别2上获得了超过94%的完整性和正确性评分。这项研究为从历史地图中有效地提取和分类道路提供了强大的工具。
https://arxiv.org/abs/2410.02250
Recently, the integration of the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprint. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. The incorporation of a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.
最近,将卷积神经网络(CNNs)在语义分割领域的局部建模能力与Transformer的全局依赖强度相结合,在语义分割社区引起了轰动。然而,大量的计算工作和高硬件内存需求仍然是它们在实时场景中进一步应用的主要障碍。在这项工作中,我们提出了一个轻量级的多个信息交互网络,称为LMIINet,用于实时语义分割,它有效地将CNNs和Transformer结合起来,同时减少了冗余计算和内存足迹。它具有轻量化的Feature Interaction Bottleneck(LFIB)模块,包括高效的卷积,增强了上下文整合。此外,通过增强局部和全局特征交互来捕捉详细语义信息,对平铺Transformer进行改进。在LFIB和Transformer块中引入组合系数学习方案,促进了特征交互。大量实验证明,LMIINet在平衡准确性和效率方面表现出色。仅需0.72M参数和11.74G FLOPs,LMIINet在Cityscapes测试集上的mIoU为72.0%,在CamVid测试数据集上的mIoU为69.94%。
https://arxiv.org/abs/2410.02224
Melanoma segmentation in Whole Slide Images (WSIs) is useful for prognosis and the measurement of crucial prognostic factors such as Breslow depth and primary invasive tumor size. In this paper, we present a novel approach that uses the Segment Anything Model (SAM) for automatic melanoma segmentation in microscopy slide images. Our method employs an initial semantic segmentation model to generate preliminary segmentation masks that are then used to prompt SAM. We design a dynamic prompting strategy that uses a combination of centroid and grid prompts to achieve optimal coverage of the super high-resolution slide images while maintaining the quality of generated prompts. To optimize for invasive melanoma segmentation, we further refine the prompt generation process by implementing in-situ melanoma detection and low-confidence region filtering. We select Segformer as the initial segmentation model and EfficientSAM as the segment anything model for parameter-efficient fine-tuning. Our experimental results demonstrate that this approach not only surpasses other state-of-the-art melanoma segmentation methods but also significantly outperforms the baseline Segformer by 9.1% in terms of IoU.
在 whole slide images (WSIs) 中对黑色素瘤进行分割划分有助于预测和测量至关重要的预后因素,如 Breslow 深度和原发侵略性肿瘤大小。在本文中,我们提出了一种新方法,利用 Segment Anything Model(SAM)对 WSIs 中的黑色素瘤进行自动分割划分。我们的方法采用了一个初始语义分割模型生成初步分割掩码,然后用于提示 SOM。我们设计了一种动态提示策略,结合了中心点和网格提示,在保持生成提示质量的同时实现超高清 WSIs 的全面覆盖。为了优化侵袭性黑色素瘤的分割,我们通过实现原位 melanoma 检测和低置信度区域过滤来进一步优化提示生成过程。我们选择 Segformer 作为初始分割模型,EfficientSAM 作为分割 anything 模型进行参数高效的微调。我们的实验结果表明,这种方法不仅在其他最先进的黑色素瘤分割方法中超过了它们,而且在 IoU 方面显著优于基线方法 9.1%。
https://arxiv.org/abs/2410.02207
Convolutional neural networks (CNNs) have shown great effectiveness in medical image segmentation. However, they may be limited in modeling large inter-subject variations in organ shapes and sizes and exploiting global long-range contextual information. This is because CNNs typically employ convolutions with fixed-sized local receptive fields and lack the mechanisms to utilize global information. To address these limitations, we developed Dynamic Multi-Resolution Convolution (DMRC) and Dynamic Multi-Scale Convolution (DMSC) modules. Both modules enhance the representation capabilities of single convolutions to capture varying scaled features and global contextual information. This is achieved in the DMRC module by employing a convolutional filter on images with different resolutions and subsequently utilizing dynamic mechanisms to model global inter-dependencies between features. In contrast, the DMSC module extracts features at different scales by employing convolutions with different kernel sizes and utilizing dynamic mechanisms to extract global contextual information. The utilization of convolutions with different kernel sizes in the DMSC module may increase computational complexity. To lessen this burden, we propose to use a lightweight design for convolution layers with a large kernel size. Thus, DMSC and DMRC modules are designed as lightweight drop-in replacements for single convolutions, and they can be easily integrated into general CNN architectures for end-to-end training. The segmentation network was proposed by incorporating our DMSC and DMRC modules into a standard U-Net architecture, termed Dynamic Multi-scale and Multi-resolution Convolution network (DMC-Net). The results demonstrate that our proposed DMSC and DMRC can enhance the representation capabilities of single convolutions and improve segmentation accuracy.
卷积神经网络(CNNs)在医学图像分割方面表现出了巨大的效果。然而,它们可能在大器官形状和大小的建模和利用全局长距离上下文信息方面受到限制。这是因为CNNs通常采用具有固定大小局部感受野的卷积操作,并缺乏利用全局信息的机制。为了应对这些限制,我们开发了动态多分辨率卷积(DMRC)和动态多尺度卷积(DMSC)模块。这两个模块通过在具有不同分辨率的图像上采用卷积操作来增强单卷积的表示能力,并利用动态机制建模特征之间的全局依赖关系。相比之下,DMSC模块通过采用不同尺寸的卷积操作来提取不同尺度的特征,并利用动态机制提取全局上下文信息。在DMSC模块中使用不同尺寸的卷积操作可能会增加计算复杂度。为了减轻这种负担,我们提出了一个适用于大卷积层的重量轻设计。因此,DMRC和DMSC模块被设计为轻量级的可插拔替代方案,可以轻松地集成到一般CNN架构中进行端到端的训练。我们提出的DMSC和DMC-Net架构将DMSC和DMSC模块集成到一个标准的U-Net架构中,称为动态多尺度多分辨率卷积网络(DMC-Net)。结果表明,我们的DMSC和DMC可以增强单卷积的表示能力,并提高分割精度。
https://arxiv.org/abs/2410.02129
In the domain of battery research, the processing of high-resolution microscopy images is a challenging task, as it involves dealing with complex images and requires a prior understanding of the components involved. The utilization of deep learning methodologies for image analysis has attracted considerable interest in recent years, with multiple investigations employing such techniques for image segmentation and analysis within the realm of battery research. However, the automated analysis of high-resolution microscopy images for detecting phases and components in composite materials is still an underexplored area. This work proposes a novel workflow for detecting components and phase segmentation from raw high resolution transmission electron microscopy (TEM) images using a trained U-Net segmentation model. The developed model can expedite the detection of components and phase segmentation, diminishing the temporal and cognitive demands associated with scrutinizing an extensive array of TEM images, thereby mitigating the potential for human errors. This approach presents a novel and efficient image analysis approach with broad applicability beyond the battery field and holds potential for application in other related domains characterized by phase and composition distribution, such as alloy production.
在电池研究领域,处理高分辨率显微镜图像的处理是一个具有挑战性的任务,因为它涉及到处理复杂的图像,并需要对参与其中的组件有 prior 理解。近年来,利用深度学习方法对图像进行分析已经引起了相当大的兴趣,多个研究在电池研究领域应用了这种技术进行图像分割和分析。然而,对复合材料中检测相和组件的自动化分析仍然是一个未被探索的领域。本文提出了一种从原始高分辨率透射电子显微镜(TEM)图像中检测组件和相分割的新工作流程,使用训练好的 U-Net 分割模型。所开发的分割模型可以加速组件和相的检测,降低对观察大量TEM图像的时间和认知要求,从而减轻了人为错误的风险。这种方法提出了一种新颖且有效的图像分析方法,具有广泛的应用前景,不仅仅局限于电池领域,还适用于其他由相和组成分布特点决定的领域,如合金生产。
https://arxiv.org/abs/2410.01928
Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \url{this https URL}
遥感图像在农业、水资源、军事和灾害救济等领域中发挥着不可替代的作用。像素级别解释是遥感图像应用的关键方面;然而,普遍的局限性仍然是需要进行大量手动注释。为了克服这个问题,我们尝试将开放词汇语义分割(OVSS)引入到遥感领域。然而,由于遥感图像对低分辨率特征的敏感性,预测掩码中显示了扭曲的目标形状和拟合边界的现象。为了应对这个问题,我们提出了一个简单且通用的上采样方法,SimFeatUp,以在训练过程中恢复深层特征的失去空间信息。此外,根据对CLIP中局部补丁token对[CLS]token的异常响应的观察,我们提出了一种直接减法操作来减轻全局边界的偏差。在17个遥感的数据集上进行广泛的实验,包括语义分割、构建、道路检测和洪水检测任务。我们的方法在4个任务上实现了平均5.8%、8.2%、4%和15.3%的改进。所有代码都已发布。\url{这个链接}
https://arxiv.org/abs/2410.01768