We study the potential of noisy labels y to pretrain semantic segmentation models in a multi-modal learning framework for geospatial applications. Specifically, we propose a novel Cross-modal Sample Selection method (CromSS) that utilizes the class distributions P^{(d)}(x,c) over pixels x and classes c modelled by multiple sensors/modalities d of a given geospatial scene. Consistency of predictions across sensors $d$ is jointly informed by the entropy of P^{(d)}(x,c). Noisy label sampling we determine by the confidence of each sensor d in the noisy class label, P^{(d)}(x,c=y(x)). To verify the performance of our approach, we conduct experiments with Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the globally-sampled SSL4EO-S12 dataset. We pair those scenes with 9-class noisy labels sourced from the Google Dynamic World project for pretraining. Transfer learning evaluations (downstream task) on the DFC2020 dataset confirm the effectiveness of the proposed method for remote sensing image segmentation.
我们研究了在多模态学习框架中,对噪声标签y对预训练语义分割模型的潜在能力。具体来说,我们提出了一种新颖的跨模态样本选择方法(CromSS),它利用了给定地理场景中多个传感器/模态d的P(x,c)类分布。通过传感器d的置信度告知预测的一致性。通过每个传感器d对噪声分类标签y的置信度来确定噪声标签采样。为了验证我们方法的性能,我们使用全球采样SSL4EO-S12数据集中的Sentinel-1(雷达)和Sentinel-2(光学)卫星图像进行实验。我们将这些场景与来自Google Dynamic World项目的9类噪声标签进行预训练。在DFC2020数据集上的迁移学习评估(下游任务)证实了所提出方法在遥感图像分割方面的有效性。
https://arxiv.org/abs/2405.01217
Self-training is a powerful approach to deep learning. The key process is to find a pseudo-label for modeling. However, previous self-training algorithms suffer from the over-confidence issue brought by the hard labels, even some confidence-related regularizers cannot comprehensively catch the uncertainty. Therefore, we propose a new self-training framework to combine uncertainty information of both model and dataset. Specifically, we propose to use Expectation-Maximization (EM) to smooth the labels and comprehensively estimate the uncertainty information. We further design a basis extraction network to estimate the initial basis from the dataset. The obtained basis with uncertainty can be filtered based on uncertainty information. It can then be transformed into the real hard label to iteratively update the model and basis in the retraining process. Experiments on image classification and semantic segmentation show the advantages of our methods among confidence-aware self-training algorithms with 1-3 percentage improvement on different datasets.
自我训练是一种强大的深度学习方法。关键的过程是找到一个伪标签来建模。然而,之前的自我训练算法由于硬标签带来的过度自信问题,即使是一些与信心相关的正则化器也无法全面捕捉不确定性。因此,我们提出了一个新的自我训练框架来结合模型和数据的不确定性信息。具体来说,我们提出使用Expectation-Maximization(EM)来平滑标签,并全面估计不确定性信息。我们还设计了一个基提取网络,从数据集中估计初始基。具有不确定性的基可以基于不确定性信息进行筛选。然后可以将其转换为真实硬标签,在重新训练过程中逐步更新模型和基。在图像分类和语义分割上的实验表明,我们的方法在关注信心的自训练算法中具有1-3个百分比的改进。
https://arxiv.org/abs/2405.01175
This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes. We focus on their potential for improved generalization, particularly when training data is limited. Despite the notable success of large vision-language models utilizing transformer architectures in various tasks, including zero-shot and few-shot learning, the deepfake detection community has still shown some reluctance to adopt pre-trained vision transformers (ViTs), especially large ones, as feature extractors. One concern is their perceived excessive capacity, which often demands extensive data, and the resulting suboptimal generalization when training or fine-tuning data is small or less diverse. This contrasts poorly with ConvNets, which have already established themselves as robust feature extractors. Additionally, training and optimizing transformers from scratch requires significant computational resources, making this accessible primarily to large companies and hindering broader investigation within the academic community. Recent advancements in using self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have showcased significant adaptability across diverse vision tasks and possess explicit semantic segmentation capabilities. By leveraging DINO for deepfake detection with modest training data and implementing partial fine-tuning, we observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism. Moreover, partial fine-tuning of transformers for deepfake detection offers a more resource-efficient alternative, requiring significantly fewer computational resources.
本文研究了自监督预训练Transformer与监督预训练Transformer和传统神经网络(CNN)在检测不同种类的深度伪造时的有效性。我们重点关注其提高泛化能力的潜力,尤其是在训练数据有限的情况下。尽管在各种任务中成功应用大型视觉语言模型(ViT)的Transformer架构取得了显著的成功,包括零 shot 和少 shot 学习,但深度伪造检测领域仍然有人不情愿采用预训练的视觉Transformer(ViT),尤其是大型的 ones 作为特征提取器。一个关注点是它们被认为具有过度的能力,通常需要大量的数据,并且在训练或微调数据小或缺乏多样性时,导致泛化效果不佳。这使得Transformer与 ConvNets 相比相形见绌,后者已经在学术研究中被证明是稳健的特征提取器。此外,从零开始训练和优化Transformer需要大量的计算资源,这使得这个方法主要对大型公司开放,同时也限制了学术研究 community 的进一步调查。最近在Transformer中使用自监督学习(SSL)的进展,如DINO及其派生物,展示了在各种视觉任务中的显著适应性,并具有显式的语义分割能力。通过利用DINO进行深度伪造检测,并在小训练数据上进行部分微调,我们观察到与任务相当的应用能力,并通过注意机制自然地解释检测结果。此外,部分微调Transformer进行深度伪造检测提供了一种更资源有效的选择,只需要显著少的计算资源。
https://arxiv.org/abs/2405.00355
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654
Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research. The DELINE8K dataset is available at this https URL.
文档语义分割是一个有前途的方法,可以促进文档分析任务,包括光学字符识别(OCR)、表单分类和文档编辑。尽管已经开发了几个合成数据集来区分手写文本和打印文本,但它们在类别的多样性和文档多样性方面都存在不足。我们证明了在解决国家档案馆形式语义分割数据集(NAFSS)时训练现有数据集的局限性。为了应对这些局限性,我们提出了目前最全面的文档语义分割合成管道,结合了来自10个以上来源的预打印文本、手写文本和文档背景,创建了文档元素层集成集8K或DELINE8K数据集。我们的定制数据集在NAFSS基准测试中表现出卓越的性能,证明其在进一步研究中具有前景的工具。DELINE8K数据集可在以下链接处获得。
https://arxiv.org/abs/2404.19259
Due to the limitations of current optical and sensor technologies and the high cost of updating them, the spectral and spatial resolution of satellites may not always meet desired requirements. For these reasons, Remote-Sensing Single-Image Super-Resolution (RS-SISR) techniques have gained significant interest. In this paper, we propose Swin2-MoSE model, an enhanced version of Swin2SR. Our model introduces MoE-SM, an enhanced Mixture-of-Experts (MoE) to replace the Feed-Forward inside all Transformer block. MoE-SM is designed with Smart-Merger, and new layer for merging the output of individual experts, and with a new way to split the work between experts, defining a new per-example strategy instead of the commonly used per-token one. Furthermore, we analyze how positional encodings interact with each other, demonstrating that per-channel bias and per-head bias can positively cooperate. Finally, we propose to use a combination of Normalized-Cross-Correlation (NCC) and Structural Similarity Index Measure (SSIM) losses, to avoid typical MSE loss limitations. Experimental results demonstrate that Swin2-MoSE outperforms SOTA by up to 0.377 ~ 0.958 dB (PSNR) on task of 2x, 3x and 4x resolution-upscaling (Sen2Venus and OLI2MSI datasets). We show the efficacy of Swin2-MoSE, applying it to a semantic segmentation task (SeasoNet dataset). Code and pretrained are available on this https URL
由于当前光学和传感器技术的局限性以及更新它们的成本高昂,卫星的谱分辨率和空间分辨率可能无法满足所需的要求。为此,遥感单景超分辨率(RS-SISR)技术引起了广泛关注。在本文中,我们提出了Swin2-MoSE模型,它是Swin2SR的增强版本。我们的模型引入了MoE-SM,一种增强的专家混合(MoE)以取代所有Transformer块中的前馈,以及一种新的专家分工作分配方式,用实例级别的策略定义,而不是通常使用的每个单词级别的策略。此外,我们分析了位置编码如何相互作用,证明了位置编码之间可以积极合作。最后,我们提出了一种结合归一化互相关(NCC)和结构相似性指数测量(SSIM)损失的方法,以避免典型的MSE损失限制。实验结果表明,Swin2-MoSE在2x、3x和4x分辨率上超越了当前最佳水平,可以达到PSNR约0.377 ~ 0.958。我们在SeasoNet数据集上展示了Swin2-MoSE的效力。代码和预训练可在https://该网址找到。
https://arxiv.org/abs/2404.18924
The scarcity of labeled data in real-world scenarios is a critical bottleneck of deep learning's effectiveness. Semi-supervised semantic segmentation has been a typical solution to achieve a desirable tradeoff between annotation cost and segmentation performance. However, previous approaches, whether based on consistency regularization or self-training, tend to neglect the contextual knowledge embedded within inter-pixel relations. This negligence leads to suboptimal performance and limited generalization. In this paper, we propose a novel approach IPixMatch designed to mine the neglected but valuable Inter-Pixel information for semi-supervised learning. Specifically, IPixMatch is constructed as an extension of the standard teacher-student network, incorporating additional loss terms to capture inter-pixel relations. It shines in low-data regimes by efficiently leveraging the limited labeled data and extracting maximum utility from the available unlabeled data. Furthermore, IPixMatch can be integrated seamlessly into most teacher-student frameworks without the need of model modification or adding additional components. Our straightforward IPixMatch method demonstrates consistent performance improvements across various benchmark datasets under different partitioning protocols.
在现实场景中,有标签数据的稀缺是深度学习效果的一个关键瓶颈。半监督语义分割是一种常见的解决方案,以实现注释成本和分割性能之间的理想平衡。然而,以前的方法,无论是基于一致性正则化还是自训练,都倾向于忽视内部像素关系中固有的上下文知识。这种疏忽导致 suboptimal 的性能和有限的泛化能力。在本文中,我们提出了一种名为 IPixMatch 的新颖方法,旨在通过半监督学习挖掘被忽视但有益的跨像素信息。具体来说,IPixMatch 是一个标准的老师-学生网络的扩展,包括额外的损失项来捕捉跨像素关系。它在低数据量的情况下通过有效地利用有限的标记数据并从可用未标记数据中挖掘最大效用来闪耀。此外,IPixMatch 可以无缝地集成到大多数老师-学生框架中,而无需对模型进行修改或添加额外组件。我们直接使用 IPixMatch 的方法在不同的分片协议下展示了 consistent 的性能提升。
https://arxiv.org/abs/2404.18891
In this paper, we emphasise the critical importance of large-scale datasets for advancing field robotics capabilities, particularly in natural environments. While numerous datasets exist for urban and suburban settings, those tailored to natural environments are scarce. Our recent benchmarks WildPlaces and WildScenes address this gap by providing synchronised image, lidar, semantic and accurate 6-DoF pose information in forest-type environments. We highlight the multi-modal nature of this dataset and discuss and demonstrate its utility in various downstream tasks, such as place recognition and 2D and 3D semantic segmentation tasks.
在本文中,我们强调了大规模数据集在推动机器人技术进步,特别是在自然环境中的重要性。虽然有许多数据集适用于城市和郊区环境,但专门针对自然环境的却很少。我们最近的大规模基准WildPlaces和WildScenes通过提供同步的图像、激光雷达、语义和精确的6维姿态信息来填补这一空白,从而解决了这一问题。我们重点突出了这个数据集的多模态性质,并讨论和展示了它在各种下游任务中的实用性,例如地点识别和2D和3D语义分割任务。
https://arxiv.org/abs/2404.18477
In the era of the Internet of Things (IoT), objects connect through a dynamic network, empowered by technologies like 5G, enabling real-time data sharing. However, smart objects, notably autonomous vehicles, face challenges in critical local computations due to limited resources. Lightweight AI models offer a solution but struggle with diverse data distributions. To address this limitation, we propose a novel Multi-Stream Cellular Test-Time Adaptation (MSC-TTA) setup where models adapt on the fly to a dynamic environment divided into cells. Then, we propose a real-time adaptive student-teacher method that leverages the multiple streams available in each cell to quickly adapt to changing data distributions. We validate our methodology in the context of autonomous vehicles navigating across cells defined based on location and weather conditions. To facilitate future benchmarking, we release a new multi-stream large-scale synthetic semantic segmentation dataset, called DADE, and show that our multi-stream approach outperforms a single-stream baseline. We believe that our work will open research opportunities in the IoT and 5G eras, offering solutions for real-time model adaptation.
在物联网(IoT)时代,物体通过动态网络连接,得益于5G等技术的支持,实现实时数据共享。然而,智能物体,特别是自动驾驶车辆,由于资源有限,在关键局部计算上面临挑战。轻量AI模型提供了解决方案,但面临着多样数据分布的困境。为了应对这一局限,我们提出了一个名为多流 cellular test-time adaptation (MSC-TTA) 的创新设置,其中模型在飞地对动态环境中的单元进行自适应调整。接着,我们提出了一种实时自适应学生-教师方法,该方法利用了每个单元中存在的多个流,以快速适应变化的数据分布。我们在基于位置和天气条件的细胞界面上验证了我们的方法。为了促进未来基准测试,我们发布了名为DADE的多流大规模合成语义分割数据集,并表明我们的多流方法优于单流基线。我们相信,我们的工作将为物联网和5G时代的研究打开大门,提供实时模型自适应解决方案。
https://arxiv.org/abs/2404.17930
Convolutional Neural Networks (CNNs) have become widely adopted for medical image segmentation tasks, demonstrating promising performance. However, the inherent inductive biases in convolutional architectures limit their ability to model long-range dependencies and spatial correlations. While recent transformer-based architectures address these limitations by leveraging self-attention mechanisms to encode long-range dependencies and learn expressive representations, they often struggle to extract low-level features and are highly dependent on data availability. This motivated us for the development of GLIMS, a data-efficient attention-guided hybrid volumetric segmentation network. GLIMS utilizes Dilated Feature Aggregator Convolutional Blocks (DACB) to capture local-global feature correlations efficiently. Furthermore, the incorporated Swin Transformer-based bottleneck bridges the local and global features to improve the robustness of the model. Additionally, GLIMS employs an attention-guided segmentation approach through Channel and Spatial-Wise Attention Blocks (CSAB) to localize expressive features for fine-grained border segmentation. Quantitative and qualitative results on glioblastoma and multi-organ CT segmentation tasks demonstrate GLIMS' effectiveness in terms of complexity and accuracy. GLIMS demonstrated outstanding performance on BraTS2021 and BTCV datasets, surpassing the performance of Swin UNETR. Notably, GLIMS achieved this high performance with a significantly reduced number of trainable parameters. Specifically, GLIMS has 47.16M trainable parameters and 72.30G FLOPs, while Swin UNETR has 61.98M trainable parameters and 394.84G FLOPs. The code is publicly available on this https URL.
卷积神经网络(CNNs)在医学图像分割任务中已经广泛应用,展示了良好的性能。然而,卷积架构固有的归纳偏见限制了它们建模长距离依赖和空间关联的能力。虽然最近基于自注意力机制的Transformer架构通过编码长距离依赖和学习富有表现力的表示来解决这些限制,但它们往往难以提取低层次特征,对数据可用性高度依赖。这促使我们开发了GLIMS,一种数据高效的关注引导混合体积分割网络。GLIMS利用Dilated Feature Aggregator卷积块(DACB)来捕捉局部到全局特征关联。此外,GLIMS通过Channel和Spatial-Wise Attention Blocks(CSAB)实现关注引导分割,对细粒度边界分割进行表达式特征定位。在胶质瘤和多器官CT分割任务上的定量和定性结果表明GLIMS在复杂性和准确性方面的出色表现。GLIMS在BraTS2021和BTCV数据集上的表现超过Swin UNETR,值得注意的是,GLIMS以显著减少的训练参数取得了这种高性能。具体来说,GLIMS有47.16M个可训练参数和72.30G FLOPs,而Swin UNETR有61.98M个可训练参数和394.84G FLOPs。代码公开可用在https://这个URL上。
https://arxiv.org/abs/2404.17854
Critical research about camera-and-LiDAR-based semantic object segmentation for autonomous driving significantly benefited from the recent development of deep learning. Specifically, the vision transformer is the novel ground-breaker that successfully brought the multi-head-attention mechanism to computer vision applications. Therefore, we propose a vision-transformer-based network to carry out camera-LiDAR fusion for semantic segmentation applied to autonomous driving. Our proposal uses the novel progressive-assemble strategy of vision transformers on a double-direction network and then integrates the results in a cross-fusion strategy over the transformer decoder layers. Unlike other works in the literature, our camera-LiDAR fusion transformers have been evaluated in challenging conditions like rain and low illumination, showing robust performance. The paper reports the segmentation results over the vehicle and human classes in different modalities: camera-only, LiDAR-only, and camera-LiDAR fusion. We perform coherent controlled benchmark experiments of CLFT against other networks that are also designed for semantic segmentation. The experiments aim to evaluate the performance of CLFT independently from two perspectives: multimodal sensor fusion and backbone architectures. The quantitative assessments show our CLFT networks yield an improvement of up to 10\% for challenging dark-wet conditions when comparing with Fully-Convolutional-Neural-Network-based (FCN) camera-LiDAR fusion neural network. Contrasting to the network with transformer backbone but using single modality input, the all-around improvement is 5-10\%.
对于基于相机和激光雷达的语义物体分割, critical research 大幅得益于最近深度学习的快速发展。 具体来说,视觉Transformer是成功将多头注意机制应用于计算机视觉应用的新型突破。 因此,我们提出了一个基于视觉Transformer的网络来进行自动驾驶中语义物体分割的相机-激光雷达融合。我们的提议在上下文中使用视觉Transformer的双向网络的渐进组装策略,然后在Transformer解码层整合结果。 与其他工作相比,我们的相机-激光雷达融合Transformer网络在具有挑战性的条件下(如雨区和低光照条件)进行了评估,展示了稳健的性能。 论文报道了不同模态下的分割结果:仅相机、仅激光雷达和相机-激光雷达融合。我们对其他为语义分割而设计的网络进行了 coherent控制基准实验。这些实验旨在从两个角度评估CLFT的性能:多模态传感器融合和骨干架构。 定量评估显示,与仅使用全卷积神经网络(FCN)的相机-激光雷达融合神经网络相比,我们的CLFT网络在具有挑战性的暗湿条件下提高了10\%以上的性能。 与使用Transformer骨干网络但使用单一模态输入的网络相比,全方位的提高达到了5-10\%。
https://arxiv.org/abs/2404.17793
Deep learning has revolutionized medical imaging by providing innovative solutions to complex healthcare challenges. Traditional models often struggle to dynamically adjust feature importance, resulting in suboptimal representation, particularly in tasks like semantic segmentation crucial for accurate structure delineation. Moreover, their static nature incurs high computational costs. To tackle these issues, we introduce Mamba-Ahnet, a novel integration of State Space Model (SSM) and Advanced Hierarchical Network (AHNet) within the MAMBA framework, specifically tailored for semantic segmentation in medical imaging.Mamba-Ahnet combines SSM's feature extraction and comprehension with AHNet's attention mechanisms and image reconstruction, aiming to enhance segmentation accuracy and robustness. By dissecting images into patches and refining feature comprehension through self-attention mechanisms, the approach significantly improves feature resolution. Integration of AHNet into the MAMBA framework further enhances segmentation performance by selectively amplifying informative regions and facilitating the learning of rich hierarchical representations. Evaluation on the Universal Lesion Segmentation dataset demonstrates superior performance compared to state-of-the-art techniques, with notable metrics such as a Dice similarity coefficient of approximately 98% and an Intersection over Union of about 83%. These results underscore the potential of our methodology to enhance diagnostic accuracy, treatment planning, and ultimately, patient outcomes in clinical practice. By addressing the limitations of traditional models and leveraging the power of deep learning, our approach represents a significant step forward in advancing medical imaging technology.
深度学习通过提供解决复杂医疗挑战的创新解决方案,彻底颠覆了医学影像学。传统的模型通常很难动态地调整特征重要性,导致效果不佳,特别是在对准确结构描绘至关重要的任务中,如语义分割。此外,它们的静态特性还导致高计算成本。为了应对这些挑战,我们引入了Mamba-Ahnet,一种将状态空间模型(SSM)和高级层次网络(AHNet)相结合的MAMBA框架,特别针对医学影像中的语义分割进行优化。Mamba-Ahnet将SSM的特征提取和理解与AHNet的注意机制和图像重建相结合,旨在提高分割准确性和稳健性。通过将图像切分为补丁并通过自注意力机制优化特征理解,该方法显著提高了特征分辨率。将AHNet融入MAMBA框架进一步提高了分割性能,通过选择性地放大有信息区域并促进丰富的层次表示学习,实现了更好的分割效果。在统一病变分割数据集上进行的评估显示,与最先进的 techniques相比,其性能具有卓越的优势,重要指标如Dice相似性系数约为98%,交集与并集之比约为83%。这些结果强调了我们在临床实践中提高诊断准确度、治疗规划和患者预后的潜力。通过解决传统模型的局限性并利用深度学习的优势,我们的方法在推动医学影像技术发展方面取得了显著的进展。
https://arxiv.org/abs/2404.17235
This paper investigates the use of deep learning approaches to estimate the femur caput-collum-diaphyseal (CCD) angle from X-ray images. The CCD angle is an important measurement in the diagnosis of hip problems, and correct prediction can help in the planning of surgical procedures. Manual measurement of this angle, on the other hand, can be time-intensive and vulnerable to inter-observer variability. In this paper, we present a deep-learning algorithm that can reliably estimate the femur CCD angle from X-ray images. To train and test the performance of our model, we employed an X-ray image dataset with associated femur CCD angle measurements. Furthermore, we built a prototype to display the resulting predictions and to allow the user to interact with the predictions. As this is happening in a sterile setting during surgery, we expanded our interface to the possibility of being used only by voice commands. Our results show that our deep learning model predicts the femur CCD angle on X-ray images with great accuracy, with a mean absolute error of 4.3 degrees on the left femur and 4.9 degrees on the right femur on the test dataset. Our results suggest that deep learning has the potential to give a more efficient and accurate technique for predicting the femur CCD angle, which might have substantial therapeutic implications for the diagnosis and management of hip problems.
本文研究了使用深度学习方法从X线图像中估计股骨颈干骺端(CCD)角。CCD角度是诊断髋问题的一个重要指标,正确的预测可以帮助规划手术治疗。然而,手动测量这个角度可能是耗时的,而且容易受到观察者间变异性。另一方面,深度学习可以可靠地估计股骨颈干骺端角度从X线图像中。为了训练和测试我们的模型的性能,我们使用了一个带有相应股骨颈干骺端测量值的X线图像数据集。此外,我们还构建了一个原型来显示预测结果并允许用户与预测进行交互。由于手术是在无菌环境中进行的,我们扩展了我们的界面,允许仅通过语音命令使用。我们的结果表明,我们的深度学习模型在X线图像上预测股骨颈干骺端具有很大准确性,在测试数据上的平均绝对误差为4.3度左股骨和4.9度右股骨。我们的结果表明,深度学习具有预测股骨颈干骺端角度的更高效和准确的方法的潜力,这可能对髋问题的诊断和管理产生重大影响。
https://arxiv.org/abs/2404.17083
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context's scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive limitation of the local window. Furthermore, depending on VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, using nearly half of UPerNet's computation, VWFormer outperforms it by 1.0%-2.5% mIoU on ADE20K. With little extra overhead, ~10G FLOPs, Mask2Former armed with VWFormer improves by 1.0%-1.3%.
多尺度学习是语义分割的核心。我们通过可视化具有规范多尺度表示的有效感受野(ERF)来指出学习它们的两个风险:规模不足和场激活不足。我们提出了一种新的多尺度学习器(Multi-scale Learner, VWA)来解决这些问题。VWA利用局部窗口注意(LWA)并解耦LWA为查询窗口和上下文窗口,使得上下文的规模对查询可以在多个尺度上学习表示。然而,将上下文增加到大型窗口(扩大比率R)可以显著增加内存 footprint 和计算成本(R^2 比 LWA 更大)。我们提出了一种简单但专业重新缩放策略来抵消额外诱导成本,同时不牺牲性能。因此,VWA使用与LWA相同的成本来克服局部窗口的感知限制。此外,根据VWA和采用各种MLP,我们引入了多尺度解码器(MSD),VWFormer,以提高语义分割的多尺度表示。VWFormer在计算开销与最可计算的MSD(如FPN和MLP解码器)相当,但性能远优于任何MSD。例如,使用UPerNet近一半的计算开销,VWFormer在ADE20K上实现了与它1.0%-2.5%的mIoU的提高。在很大程度上不需要额外的开销,大约10G FLOPs,Mask2Former配备了VWFormer后,性能提高了1.0%-1.3%。
https://arxiv.org/abs/2404.16573
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP2AM) to transfer knowledge at both prediction and prototype levels. RP$^2$AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.
在本文中,我们解决了仅使用预训练的孔洞图(source)和未标注的全景图像(target)进行无监督域适应(SFUDA)的问题,以实现孔洞到全景语义分割。解决这一问题是不简单的,因为存在三个关键挑战:1)不同域之间语义不匹配,2)源域问题中的风格差异,3)全景图像中不可避免的扭曲。为了应对这些问题,我们提出了360SFUDA++,它有效地从仅有的未标注全景图像中提取知识,并将可靠的知识传递到目标全景域。具体来说,我们首先利用切线投影(TP)作为它具有较少的扭曲,同时将等角投影(ERP)切成固定 FoV 投影(FFP)的补丁,以模仿孔洞图像。两个投影在提取知识方面都有效。然而,由于不同的投影使得域之间知识传递变得困难,我们 then 引入了可靠的全景原型适应模块(RP2AM),在预测和原型级别上传递知识。RP2AM 选择自信的知识,并整合全景原型以实现可靠的知识适应。此外,我们还引入了跨投影双重注意模块(CDAM),它更好地对域之间的特征水平进行投影之间的空间和通道特征的同步调整。知识提取和传递过程都被同步更新,以达到最佳性能。在合成和真实世界基准上的广泛实验,包括户外和室内场景,证明了我们的360SFUDA++在性能上显著优于前面的SFUDA方法。
https://arxiv.org/abs/2404.16501
Despite the remarkable success of deep learning in medical imaging analysis, medical image segmentation remains challenging due to the scarcity of high-quality labeled images for supervision. Further, the significant domain gap between natural and medical images in general and ultrasound images in particular hinders fine-tuning models trained on natural images to the task at hand. In this work, we address the performance degradation of segmentation models in low-data regimes and propose a prompt-less segmentation method harnessing the ability of segmentation foundation models to segment abstract shapes. We do that via our novel prompt point generation algorithm which uses coarse semantic segmentation masks as input and a zero-shot prompt-able foundation model as an optimization target. We demonstrate our method on a segmentation findings task (pathologic anomalies) in ultrasound images. Our method's advantages are brought to light in varying degrees of low-data regime experiments on a small-scale musculoskeletal ultrasound images dataset, yielding a larger performance gain as the training set size decreases.
尽管在医学影像分析中深度学习的成功已经让人印象深刻,但由于高质量 labeled 图像的稀缺性,医学图像分割仍然具有挑战性。此外,自然图像和医学图像以及超声图像之间显著的领域差距会阻碍将基于自然图像训练的模型用于当前任务的微调。在这项工作中,我们解决了在低数据 regime 下分割模型的性能下降问题,并提出了一个无需提示的分割方法,利用分割基础模型的能力对抽象形状进行分割。我们通过使用粗粒度语义分割掩码作为输入和零散提示可优化目标来实现这一目标。我们在超声图像数据集上展示了我们的方法。在小的多关节超声图像数据集上进行低数据 regime 实验,各种低数据 regime 实验都表明,随着训练集大小的减小,性能提高。
https://arxiv.org/abs/2404.16325
Unsupervised Domain Adaptation (UDA) refers to the method that utilizes annotated source domain data and unlabeled target domain data to train a model capable of generalizing to the target domain data. Domain discrepancy leads to a significant decrease in the performance of general network models trained on the source domain data when applied to the target domain. We introduce a straightforward approach to mitigate the domain discrepancy, which necessitates no additional parameter calculations and seamlessly integrates with self-training-based UDA methods. Through the transfer of the target domain style to the source domain in the latent feature space, the model is trained to prioritize the target domain style during the decision-making process. We tackle the problem at both the image-level and shallow feature map level by transferring the style information from the target domain to the source domain data. As a result, we obtain a model that exhibits superior performance on the target domain. Our method yields remarkable enhancements in the state-of-the-art performance for synthetic-to-real UDA tasks. For example, our proposed method attains a noteworthy UDA performance of 76.93 mIoU on the GTA->Cityscapes dataset, representing a notable improvement of +1.03 percentage points over the previous state-of-the-art results.
无监督领域适应(UDA)是指利用已标注的源域数据和未标注的目标域数据来训练一个能够泛化到目标域数据的模型。领域差异导致在将基于源域数据的通用网络模型应用于目标域数据时,模型的性能显著下降。我们引入了一种直接的方法来减轻领域差异,这不需要额外的参数计算,并无缝地与基于自训练的UDA方法集成。通过将目标域的风格信息传递到源域的潜在特征空间中,模型在决策过程中优先考虑目标域的风格。我们通过从目标域数据中传递样式信息来解决该问题。结果,我们在目标域上获得了卓越的性能。我们的方法在合成-真实UDA任务上的最先进性能有了显著提高。例如,与之前的结果相比,我们提出的UDA性能达到了+1.03%的显著提高。
https://arxiv.org/abs/2404.16301
As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to balance multiple optimization objectives, including model accuracy, inference speed, and hardware-specific considerations. Despite its importance, benchmarks have yet to be developed to frame such a challenging task as multi-objective optimization. To bridge the gap, we introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. Building upon the streamline, we present a benchmark test suite, CitySeg/MOP, comprising fifteen MOPs derived from the Cityscapes dataset. The CitySeg/MOP test suite is integrated into the EvoXBench platform to provide seamless interfaces with various programming languages (e.g., Python and MATLAB) for instant fitness evaluations. We comprehensively assessed the CitySeg/MOP test suite on various multi-objective evolutionary algorithms, showcasing its versatility and practicality. Source codes are available at this https URL.
作为自动机器学习领域新兴挑战之一,硬件感知的神经架构搜索(HW-NAS)任务可以被视为多目标优化问题(MOPs)。HW-NAS在实时语义分割(Real-time Semantic Segmentation,RSS)中的应用至关重要。为实现实时语义分割,HW-NAS在实时语义分割本身就需要在多个优化目标之间取得平衡,包括模型准确性、推理速度和硬件特定考虑。尽管HW-NAS在实时语义分割中具有重要性,但迄今为止还没有为这样的具有挑战性的任务开发基准。为了填补这一空白,我们引入了一个定制的流线来将HW-NAS在实时语义分割中的任务转化为标准的MOP。在此基础上,我们提出了一个基准测试套件,CitySeg/MOP,包含来自Cityscapes数据集的15个MOP。CitySeg/MOP测试套件已集成到EvoXBench平台中,为各种编程语言(例如Python和MATLAB)提供了一个无缝的界面来进行即时的健身评估。我们对CitySeg/MOP测试套件进行了全面评估,展示了其多才性和实用性。源代码可以从此链接获取:https://url.cn/xyz6uJ4
https://arxiv.org/abs/2404.16266
Patellofemoral joint (PFJ) issues affect one in four people, with 20% experiencing chronic knee pain despite treatment. Poor outcomes and pain after knee replacement surgery are often linked to patellar mal-tracking. Traditional imaging methods like CT and MRI face challenges, including cost and metal artefacts, and there's currently no ideal way to observe joint motion without issues such as soft tissue artefacts or radiation exposure. A new system to monitor joint motion could significantly improve understanding of PFJ dynamics, aiding in better patient care and outcomes. Combining 2D ultrasound with motion tracking for 3D reconstruction of the joint using semantic segmentation and position registration can be a solution. However, the need for expensive external infrastructure to estimate the trajectories of the scanner remains the main limitation to implementing 3D bone reconstruction from handheld ultrasound scanning clinically. We proposed the Visual-Inertial Odometry (VIO) and the deep learning-based inertial-only odometry methods as alternatives to motion capture for tracking a handheld ultrasound scanner. The 3D reconstruction generated by these methods has demonstrated potential for assessing the PFJ and for further measurements from free-hand ultrasound scans. The results show that the VIO method performs as well as the motion capture method, with average reconstruction errors of 1.25 mm and 1.21 mm, respectively. The VIO method is the first infrastructure-free method for 3D reconstruction of bone from wireless handheld ultrasound scanning with an accuracy comparable to methods that require external infrastructure.
翻译:Patellofemoral joint (PFJ) 问题影响四分之一的人,即使经过治疗,20%的人仍然会经历慢性膝盖疼痛。腿部置换手术后的不良结果和疼痛通常与膝关节不良运动有关。传统的影像技术如 CT 和 MRI 面临成本和金属伪影等挑战,目前没有理想的方法在没有软组织伪影或辐射暴露等问题的情况下观察关节运动。一种新系统监测关节运动可能显著改善对 PFJ 动态的理解,有助于提高患者护理和治疗效果。将 2D 超声与运动跟踪结合进行关节三维重建可以使用语义分割和位置配准,可能是解决方案。然而,需要昂贵的外部基础设施估计扫描器的轨迹仍然是实施临床超声三维骨重建的主要限制。我们提出了视觉惯性测量 (VIO) 和基于深度学习的惯性仅运动跟踪方法作为手持超声扫描器的运动捕捉替代方法。这些方法产生的 3D 重建已经证明了评估 PFJ 的潜力和从自由手超声扫描中进行进一步测量的可能性。结果表明,VIO 方法与运动捕捉方法的表现相同,平均重建误差分别为 1.25 mm 和 1.21 mm。VIO 方法是第一个无基础设施免费的 3D 骨重建方法,其准确性相当于需要外部基础设施的方法。
https://arxiv.org/abs/2404.15847