Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper, we propose Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for ViT adaptation. Specifically, besides using the lightweight adapter modules, we propose a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference. Additionally, we explore multiple design variants to find the best practice of DyT. Finally, inspired by the mixture-of-experts (MoE) mechanism, we introduce an enhanced adapter to further boost the adaptation performance. We validate DyT across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves comparable or even superior performance compared to existing PEFT methods while evoking only 71%-85% of their FLOPs on the VTAB-1K benchmark.
现有的参数高效的微调(PEFT)方法通过提高参数效率在视觉 transformer(ViT)的适应中取得了显著的成功。然而,在迁移适应过程中提高推理效率的研究仍然鲜被探索。这限制了预训练 ViT 模型的更广泛应用,尤其是当模型具有很高的计算复杂度时。在本文中,我们提出了动态调整(DyT),一种提高 ViT 适应参数和推理效率的新方法。具体来说,除了使用轻量级适配器模块外,我们还提出了一个标记分配器来区分有用的标记和无用的标记,允许后者在原始块动态跳过,从而在推理过程中减少冗余计算。此外,我们探索了多种设计变体,以找到最佳的 DyT 实践。最后,受到混合专家(MoE)机制的启发,我们引入了增强型适配器,进一步提高了迁移适应性能。我们对 DyT 在各种任务(包括图像/视频识别和语义分割)上的性能进行了评估。例如,DyT 实现了与现有 PEFT 方法相当甚至更好的性能,同时仅激活了它们 FLOPs 的 71%-85%。
https://arxiv.org/abs/2403.11808
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at this https URL.
由于其固有的复杂性,遥感图像对下游任务提出了独特的挑战。尽管已经进行了大量关于遥感分类、目标检测和语义分割的研究,但大多数研究都忽视了遥感场景中固有的有价值知识。这些先验知识可能会有所帮助,因为如果没有引用足够长的距离上下文,遥感物体可能会被错误地识别。本文考虑了这些先验知识,并提出了一个轻量级的大选择性核网络(LSKNet)骨干网络。LSKNet可以动态地调整其大空间感受野,更好地建模远程感测场景中各种对象的距离上下文。据我们所知,在遥感图像中,大且选择性的内核机制之前没有被探讨过。没有花言巧语,我们的轻量LSKNet在标准遥感分类、目标检测和语义分割基准测试中达到了最先进的分数。我们全面的分析进一步验证了识别出的先验知识和LSKNet的有效性。代码可在此处访问:https://www.xively.ai/code/lskm
https://arxiv.org/abs/2403.11735
Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples.
Test-Time Training (TTT)是一种将预训练网络应用于变化数据分布的方法。在这项工作中,我们提出了第一个3D语义分割的TTT方法,即TTT-KD,它将知识蒸馏(KD)从基础模型(如DINOv2)的自监督目标建模为适应分布漂移的自我监督目标。在获得成对图像点云(2D-3D)数据后,我们首先使用预训练的2D基础模型对主要语义分割任务进行优化,并通过使用离线预训练的2D基础模型来对2D到3D KD任务进行优化。在测试时,我们的TTT-KD在每个测试样本上更新3D语义分割骨干网络,通过使用自监督的KD任务,在进行最终预测之前。在多个室内和室外3D语义分割基准上进行广泛的评估表明,TTT-KD具有很高的实用价值,因为它可以提高 both in-distribution (ID) 和 out-of-distribution (ODO) 测试数据集的性能。我们发现,当训练和测试分布相似时,模型的性能提高 up to 13%(平均值为7%),而当适应 OOD 测试样本时,性能 up to 45%(平均值为20%)。
https://arxiv.org/abs/2403.11691
Multi-target domain adaptation (MTDA) for semantic segmentation poses a significant challenge, as it involves multiple target domains with varying distributions. The goal of MTDA is to minimize the domain discrepancies among a single source and multi-target domains, aiming to train a single model that excels across all target domains. Previous MTDA approaches typically employ multiple teacher architectures, where each teacher specializes in one target domain to simplify the task. However, these architectures hinder the student model from fully assimilating comprehensive knowledge from all target-specific teachers and escalate training costs with increasing target domains. In this paper, we propose an ouroboric domain bridging (OurDB) framework, offering an efficient solution to the MTDA problem using a single teacher architecture. This framework dynamically cycles through multiple target domains, aligning each domain individually to restrain the biased alignment problem, and utilizes Fisher information to minimize the forgetting of knowledge from previous target domains. We also propose a context-guided class-wise mixup (CGMix) that leverages contextual information tailored to diverse target contexts in MTDA. Experimental evaluations conducted on four urban driving datasets (i.e., GTA5, Cityscapes, IDD, and Mapillary) demonstrate the superiority of our method over existing state-of-the-art approaches.
多目标领域自适应(MTDA)对于语义分割是一个具有挑战性的问题,因为它涉及多个具有不同分布的目标领域。MTDA的目标是最小化单一来源和多个目标领域之间的领域差异,旨在训练一个在所有目标领域都表现优秀的单一模型。以前的多目标领域自适应方法通常采用多个教师架构,每个教师专门研究一个目标领域以简化任务。然而,这些架构阻碍了学生模型充分吸收所有目标特定教师的全面知识,并随着目标领域的增加,训练成本不断增加。在本文中,我们提出了一个 ouroboric domain bridging (OurDB) 框架,利用单个教师架构有效地解决了MTDA问题。这个框架通过动态循环多个目标领域,将每个领域单独对齐,以限制偏斜对齐问题,并利用Fisher信息最小化以前目标领域的知识遗忘。我们还提出了一个上下文引导的类级混合(CGMix),它利用了针对不同目标上下文的语境信息,实现了MTDA中的混合。在四个城市驾驶数据集(即GTA5、城市风光、IDD和Mapillary)上的实验评估结果表明,我们的方法比现有的最先进方法具有优越性。
https://arxiv.org/abs/2403.11582
Perception plays a crucial role in various robot applications. However, existing well-annotated datasets are biased towards autonomous driving scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often lack environment and domain variations. To expand the frontier of these fields, we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), featuring a wide range of sensing modalities, high-accuracy ground truth, and diverse challenging environments across three Eurasian university campuses. MCD comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce semantic annotations of 29 classes over 59k sparse NRE lidar scans across three domains, thus providing a novel challenge to existing semantic segmentation research upon this largely unexplored lidar modality. Finally, we propose, for the first time to the best of our knowledge, continuous-time ground truth based on optimization-based registration of lidar-inertial data on large survey-grade prior maps, which are also publicly released, each several times the size of existing ones. We conduct a rigorous evaluation of numerous state-of-the-art algorithms on MCD, report their performance, and highlight the challenges awaiting solutions from the research community.
感知在各种机器人应用中扮演着关键角色。然而,现有的经过良好注释的数据集偏向于自动驾驶场景,而未标注的SLAM数据集会很快过拟合,并且通常缺乏环境和领域变化。为了扩展这些领域的前沿,我们引入了一个名为MCD(多校园数据集)的全面数据集,其中包括广泛的感测模式、高精度的目标跟踪数据和来自欧洲三个大学校园的多样挑战环境。MCD包括CSC(经典圆柱形旋转)和NRE(非重复周期环形)激光雷达、高质量的惯性测量单元(IMU)、相机和 UWB(超宽波长)传感器。此外,我们在一个具有创新性的努力中引入了跨越三个领域的29类语义注释,对3D NRE激光雷达扫描进行语义标注,这为现有语义分割研究带来了新的挑战。最后,我们提出了一种新颖的基于优化基于重采样定位的连续时间目标跟踪方法,该方法基于大型调查级先验图对激光雷达-惯性数据进行优化,这些数据也是公开发布的,每个的大小是现有方法的许多倍。我们在MCD上对多个最先进的算法进行了严格的评估,报告了它们的性能,并强调了研究社区需要解决的挑战。
https://arxiv.org/abs/2403.11496
Semantic segmentation is a crucial task in computer vision, where each pixel in an image is classified into a category. However, traditional methods face significant challenges, including the need for pixel-level annotations and extensive training. Furthermore, because supervised learning uses a limited set of predefined categories, models typically struggle with rare classes and cannot recognize new ones. Unsupervised and open-vocabulary segmentation, proposed to tackle these issues, faces challenges, including the inability to assign specific class labels to clusters and the necessity of user-provided text queries for guidance. In this context, we propose a novel approach, TAG which achieves Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. TAG utilizes pre-trained models such as CLIP and DINO to segment images into meaningful categories without additional training or dense annotations. It retrieves class labels from an external database, providing flexibility to adapt to new scenarios. Our TAG achieves state-of-the-art results on PascalVOC, PascalContext and ADE20K for open-vocabulary segmentation without given class names, i.e. improvement of +15.3 mIoU on PascalVOC. All code and data will be released at this https URL.
语义分割是计算机视觉中一个关键的任务,在其中,图像的每个像素都被分类为一种类别。然而,传统方法面临着显著的挑战,包括需要像素级的注释和广泛的训练。此外,因为监督学习使用有限的预定义类别,所以模型通常难以处理稀有类别的识别,并且无法识别新的类别。无监督和开维词义分割试图解决这些问题,但面临挑战,包括无法为聚类分配具体类标签和需要用户提供的文本查询以获取指导。在这种情况下,我们提出了一个新方法TAG,它实现了无需训练或密集注释的开放维词义分割。TAG利用预训练的模型,如CLIP和DINO,将图像分割为有意义的类别,而无需额外的训练或密集注释。它从外部数据库中检索类别标签,提供灵活性以适应新的场景。我们的TAG在PascalVOC、PascalContext和ADE20K上实现了与开放维词义分割的最新结果相同的性能,即没有给定类名的PascalVOC上提高了+15.3 mIoU。所有代码和数据都将发布在这个https://URL上。
https://arxiv.org/abs/2403.11197
Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at this https URL.
语义分割在计算机视觉中有各种应用,然而传统的解决方案面临着显着挑战,包括注释成本高和监督学习广泛训练的问题。此外,由于监督学习预定义类别的有限性,模型通常难以处理罕见类别,并且无法预测新类。为了应对这些限制,我们提出了MaskDiffusion,一种创新的方法,它利用预训练的冻结稳定扩散,实现无需额外训练或注释的开放词汇语义分割,从而提高了性能与类似方法相比。我们还证明了MaskDiffusion在处理开放词汇方面的优越性能,包括基于细粒度和 proper noun 的类别,从而扩大了语义分割应用的范围。总之,与其它比较相似的无需训练或注释的分割方法相比,我们的MaskDiffusion在质量和数量上都有显着改进,即在Potsdam数据集上提高了+10.5 mIoU,而在COCO-Stuff数据集上提高了+14.8 mIoU。所有代码和数据都将在这个https URL上发布。
https://arxiv.org/abs/2403.11194
Recently, One-stage Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has gained increasing interest due to simplification over its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of Class Activation Map (CAM), we observe that one-stage pipelines often encounter confirmation bias caused by incorrect CAM pseudo-labels, impairing their final segmentation performance. Although recent works discard many unreliable pseudo-labels to implicitly alleviate this issue, they fail to exploit sufficient supervision for their models. To this end, we propose a dual student framework with trustworthy progressive learning (DuPL). Specifically, we propose a dual student network with a discrepancy loss to yield diverse CAMs for each sub-net. The two sub-nets generate supervision for each other, mitigating the confirmation bias caused by learning their own incorrect pseudo-labels. In this process, we progressively introduce more trustworthy pseudo-labels to be involved in the supervision through dynamic threshold adjustment with an adaptive noise filtering strategy. Moreover, we believe that every pixel, even discarded from supervision due to its unreliability, is important for WSSS. Thus, we develop consistency regularization on these discarded regions, providing supervision of every pixel. Experiment results demonstrate the superiority of the proposed DuPL over the recent state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is available at this https URL.
近年来,随着One-stage Weakly Supervised Semantic Segmentation (WSSS)通过图像级标签简化了其繁琐的多阶段 counterparts,这种方法受到了越来越多的关注。然而,由于类别激活图(CAM)固有的不确定性,我们观察到一种一阶段管道经常受到由于错误CAM伪标签而引起的确认偏差,从而降低了其最终分割性能。尽管最近的工作舍弃了许多不可靠的伪标签来 implicitly 缓解这个问题,但它们没有充分利用足够的监督来提高模型。因此,我们提出了一个双向学生框架DuPL。具体来说,我们提出了一个具有差异损失的双学生网络,用于生成每个子网的多样化CAM。这两个子网络相互提供监督,减轻了由于学习自己的错误伪标签而引起的确认偏差。在这个过程中,我们通过动态阈值调整采用自适应噪声滤波策略逐步引入更多可信的伪标签。此外,我们认为每个像素,即使由于不可靠而被舍弃,也是WSSS中不可或缺的一部分。因此,我们在这些被舍弃的区域内开发了一致性正则化,为每个像素提供监督。实验结果表明,与最近的先进技术相比,DuPL具有优越性。代码可以从该链接获取。
https://arxiv.org/abs/2403.11184
Few-shot segmentation models excel in metal defect detection due to their rapid generalization ability to new classes and pixel-level segmentation, rendering them ideal for addressing data scarcity issues and achieving refined object delineation in industrial applications. Existing works neglect the \textit{Intra-Class Differences}, inherent in metal surface defect data, which hinders the model from learning sufficient knowledge from the support set to guide the query set segmentation. Specifically, it can be categorized into two types: the \textit{Semantic Difference} induced by internal factors in metal samples and the \textit{Distortion Difference} caused by external factors of surroundings. To address these differences, we introduce a \textbf{L}ocal d\textbf{E}scriptor based \textbf{R}easoning and \textbf{E}xcitation \textbf{Net}work (\textbf{LERENet}) to learn the two-view guidance, i.e., local and global information from the graph and feature space, and fuse them to segment precisely. Since the relation structure of local features embedded in graph space will help to eliminate \textit{Semantic Difference}, we employ Multi-Prototype Reasoning (MPR) module, extracting local descriptors based prototypes and analyzing local-view feature relevance in support-query pairs. Besides, due to the global information that will assist in countering the \textit{Distortion Difference} in observations, we utilize Multi-Prototype Excitation (MPE) module to capture the global-view relations in support-query pairs. Finally, we employ an Information Fusion Module (IFM) to fuse learned prototypes in local and global views to generate pixel-level masks. Our comprehensive experiments on defect datasets demonstrate that it outperforms existing benchmarks, establishing a new state-of-the-art.
少样本分割模型在金属缺陷检测方面表现出色,因为它们具有对新类和像素级别分割的快速泛化能力,这使得它们非常适合解决数据稀疏性问题并在工业应用中实现精确的目标分割。现有工作忽略了金属表面缺陷数据中固有的类内差异,这阻碍了模型从支持集获取足够知识以指导查询集分割。具体来说,可以分为两种类型:由内部因素在金属样品中产生的语义差异和由外部环境因素引起的扭曲差异。为了应对这些差异,我们引入了一种基于推理和激发的局部描述器(LERENet)来学习两视图指导,即从图和特征空间中的局部信息,并将它们融合进行精确分割。由于局部特征在图空间中的关系结构有助于消除语义差异,我们采用多原型推理(MPR)模块,根据原型提取局部描述符,并分析支持-查询对中的局部视图特征的相关性。此外,由于全局信息有助于对抗观察中的扭曲差异,我们采用多原型激发(MPE)模块来捕捉支持-查询对中的全局视图关系。最后,我们采用信息融合模块(IFM)将学到的原型在局部和全局视图中融合生成像素级别掩码。我们对缺陷数据集的全面实验证明,它超过了现有基准,确立了新状态下的最佳。
https://arxiv.org/abs/2403.11122
Remote sensing image super-resolution (SR) is a crucial task to restore high-resolution (HR) images from low-resolution (LR) observations. Recently, the Denoising Diffusion Probabilistic Model (DDPM) has shown promising performance in image reconstructions by overcoming problems inherent in generative models, such as over-smoothing and mode collapse. However, the high-frequency details generated by DDPM often suffer from misalignment with HR images due to the model's tendency to overlook long-range semantic contexts. This is attributed to the widely used U-Net decoder in the conditional noise predictor, which tends to overemphasize local information, leading to the generation of noises with significant variances during the prediction process. To address these issues, an adaptive semantic-enhanced DDPM (ASDDPM) is proposed to enhance the detail-preserving capability of the DDPM by incorporating low-frequency semantic information provided by the Transformer. Specifically, a novel adaptive diffusion Transformer decoder (ADTD) is developed to bridge the semantic gap between the encoder and decoder through regulating the noise prediction with the global contextual relationships and long-range dependencies in the diffusion process. Additionally, a residual feature fusion strategy establishes information exchange between the two decoders at multiple levels. As a result, the predicted noise generated by our approach closely approximates that of the real noise distribution.Extensive experiments on two SR and two semantic segmentation datasets confirm the superior performance of the proposed ASDDPM in both SR and the subsequent downstream applications. The source code will be available at this https URL.
遥感图像超分辨率(SR)是将低分辨率(LR)观察结果恢复为高分辨率(HR)图像的关键任务。最近,由Denoising Diffusion Probabilistic Model(DDPM)产生的图像重构已经通过克服生成模型的固有问题的表现表明了具有前景。然而,DDPM产生的高频细节往往由于模型倾向于忽视长距离语义上下文而与HR图像错位。这归因于在条件噪声预测中广泛使用的U-Net解码器,它倾向于强调局部信息,导致预测过程中生成具有显著方差的大噪声。为了应对这些问题,我们提出了一个自适应语义增强的DDPM(ASDDPM),通过整合Transformer提供的低频语义信息来增强DDPM的细节保留能力。具体来说,我们开发了一种新的自适应扩散Transformer解码器(ADTD)来通过全局上下文关系和扩散过程的噪声预测来调节信息在编码器和解码器之间的交换。此外,残差特征融合策略建立了在多个级别上两个解码器之间的信息交流。通过这种方式,我们方法产生的预测噪声与真实噪声分布非常接近。在两个SR和两个语义分割数据集上的实验证实了所提出的ASDDPM在SR和后续应用中的优越性能。源代码将在此处链接。
https://arxiv.org/abs/2403.11078
Crashes and delays at Railroad Highway Grade Crossings (RHGC), where highways and railroads intersect, pose significant safety concerns for the U.S. Federal Railroad Administration (FRA). Despite the critical importance of addressing accidents and traffic delays at highway-railroad intersections, there is a notable dearth of research on practical solutions for managing these issues. In response to this gap in the literature, our study introduces an intelligent system that leverages machine learning and computer vision techniques to enhance safety at Railroad Highway Grade crossings (RHGC). This research proposed a Non-Maximum Suppression (NMS)- based ensemble model that integrates a variety of YOLO variants, specifically YOLOv5S, YOLOv5M, and YOLOv5L, for grade-crossing object detection, utilizes segmentation techniques from the UNet architecture for detecting approaching rail at a grade crossing. Both methods are implemented on a Raspberry Pi. Moreover, the strategy employs high-definition cameras installed at the RHGC. This framework enables the system to monitor objects within the Region of Interest (ROI) at crossings, detect the approach of trains, and clear the crossing area before a train arrives. Regarding accuracy, precision, recall, and Intersection over Union (IoU), the proposed state-of-the-art NMS-based object detection ensemble model achieved 96% precision. In addition, the UNet segmentation model obtained a 98% IoU value. This automated railroad grade crossing system powered by artificial intelligence represents a promising solution for enhancing safety at highway-railroad intersections.
在铁路公路交叉处(RHGC)发生的事故和交通拥堵对美国联邦铁路管理局(FRA)来说具有显著的安全隐患。尽管解决道路和铁路交叉处的事故和交通拥堵对保障交通安全至关重要,但关于这些问题的实用解决方案的研究却相当匮乏。为了填补这一研究领域的空白,我们的研究引入了一种智能系统,利用机器学习和计算机视觉技术增强在铁路公路交叉处的安全性。该研究提出了一种基于非最大抑制(NMS)的集成模型,该模型集成了各种YOLO变体,特别是YOLOv5S、YOLOv5M和YOLOv5L,用于级交叉物体检测,利用UNet架构的分割技术来检测即将到来的铁路。两种方法都在Raspberry Pi上实现。此外,该策略还在交叉处安装了高清摄像头。这个框架使得系统能够在交叉处监控物体,检测铁路的到来,并在火车到来前清除交叉区域。关于精度、精确度、召回率和交叉Over Union(IoU),与最先进的基于NMS的对象检测集成模型相比,该建议的智能铁路交叉处检测集成模型取得了96%的精度。此外,UNet分割模型获得了98%的IoU值。由人工智能驱动的自动铁路交叉处系统代表了一种加强高速公路-铁路交叉处安全性的有益解决方案。
https://arxiv.org/abs/2403.11060
Cytology image segmentation is quite challenging due to its complex cellular structure and multiple overlapping regions. On the other hand, for supervised machine learning techniques, we need a large amount of annotated data, which is costly. In recent years, late fusion techniques have given some promising performances in the field of image classification. In this paper, we have explored a fuzzy-based late fusion techniques for cytology image segmentation. This fusion rule integrates three traditional semantic segmentation models UNet, SegNet, and PSPNet. The technique is applied on two cytology image datasets, i.e., cervical cytology(HErlev) and breast cytology(JUCYT-v1) image datasets. We have achieved maximum MeanIoU score 84.27% and 83.79% on the HErlev dataset and JUCYT-v1 dataset after the proposed late fusion technique, respectively which are better than that of the traditional fusion rules such as average probability, geometric mean, Borda Count, etc. The codes of the proposed model are available on GitHub.
细胞学图像分割由于其复杂的细胞结构和多个重叠的区域而变得非常具有挑战性。另一方面,对于监督机器学习方法,我们需要大量注释数据,这很昂贵。近年来,晚期融合技术在图像分类领域已经给出了一些有前景的性能。在本文中,我们探讨了一种基于模糊的晚期融合技术来进行细胞学图像分割。这种融合规则整合了三种传统语义分割模型UNet、SegNet和PSPNet。该技术应用于两个细胞学图像数据集,即宫颈细胞学(HErlev)和乳腺细胞学(JUCYT-v1)图像数据集。在所提出的晚期融合技术应用于HErlev数据集和JUCYT-v1数据集后,我们分别获得了最大MeanIoU分数84.27%和83.79%,均优于传统融合规则(如平均概率、几何平均、Borda计数等)。所提出模型的代码可以在GitHub上找到。
https://arxiv.org/abs/2403.10884
Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels, which in turn is more annotator-friendly than the standard one inquiring to classify a pixel directly according to our theoretical analysis and user study. Specifically, leveraging foundation models providing useful zero-shot predictions on pseudo labels and superpixels, our method comprises two key techniques: (i) an annotator-friendly design of correction query with the pseudo labels, and (ii) an acquisition function looking ahead label expansions based on the superpixels. Experimental results on PASCAL, Cityscapes, and Kvasir-SEG datasets demonstrate the effectiveness of our ALC framework, outperforming prior methods for active semantic segmentation and label correction. Notably, utilizing our method, we obtained a revised dataset of PASCAL by rectifying errors in 2.6 million pixels in PASCAL dataset.
为了培训和验证语义分割模型,需要具有像素级的注释数据集。这些数据集通常非常费力。虽然有用的先验模型或 crowdsourced 数据集可用,但它们存在错误。因此,我们提出了一个基于纠正查询的设计的主动标签修复(ALC)框架,该框架基于纠正查询来修复像素的伪标签,这反过来比标准的直接根据理论分析和用户研究分类像素的方法更受标注者友好。具体来说,通过利用提供有关伪标签的有用零散预测的基础模型,我们的方法包括两个关键技术:(i)具有伪标签的 annotator-friendly 设计,和(ii)基于超像素的标签扩展获取函数。在 PASCAL、Cityscapes 和 Kvasir-SEG 数据集上的实验结果表明,我们提出的 ALC 框架在语义分割和行动标注方面取得了有效成果,超过了先前的方法。值得注意的是,通过使用我们的方法,我们在 PASCAL 数据集上修复了 2.6 百万像素的错误,从而获得了修订后的数据集。
https://arxiv.org/abs/2403.10820
This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{this https URL}.
本文提出了一种创新的多任务学习框架,可以在单个摄像机上同时进行深度估计和语义分割。所提出的框架基于共享编码器-解码器架构,该架构将各种技术集成在一起,以提高深度估计和语义分割任务的准确性,同时不牺牲计算效率。此外,论文还引入了一个对抗训练组件,使用Wasserstein GAN框架和批评网络,来优化模型的预测。该框架在两个数据集——户外城市景观数据集和室内NYU Depth V2数据集——进行了全面评估,在分割和深度估计任务上均优于现有最先进的方法。我们还进行了消融研究,以分析不同组件(包括预训练策略、批评者的包含、对数深度缩放和高级图像增强)对所提出的框架的贡献,以更好地理解所提出的框架。附带的源代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10662
Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
深度特征是计算机视觉研究的核心,捕捉图像语义并使社区能够在零或少数样本的情况下解决下游任务。然而,这些特征通常缺乏进行如分割和深度预测等直接密集预测任务的空间分辨率,因为模型在大型区域上积极抽取信息。在这项工作中,我们引入了FeatUp,一个任务和模型无关的框架,用于在深度特征中恢复丢失的时空信息。我们引入了两种FeatUp变体:一种在单前向传递中指导具有高分辨率信号的特征,另一种将隐式模型适配于单个图像以重构任何分辨率下的特征。两种方法都使用深度模拟NeRFs的多视图一致性损失。我们的特征保留其原始语义,可以交换到现有的应用程序中,甚至在没有重新训练的情况下实现分辨率和性能的提升。我们证明了FeatUp在类激活图生成、用于分割和深度预测的迁移学习以及语义分割的端到端训练方面显著优于其他特征放大和图像超分辨率方法。
https://arxiv.org/abs/2403.10516
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
Surgical instrument segmentation in laparoscopy is essential for computer-assisted surgical systems. Despite the Deep Learning progress in recent years, the dynamic setting of laparoscopic surgery still presents challenges for precise segmentation. The nnU-Net framework excelled in semantic segmentation analyzing single frames without temporal information. The framework's ease of use, including its ability to be automatically configured, and its low expertise requirements, have made it a popular base framework for comparisons. Optical flow (OF) is a tool commonly used in video tasks to estimate motion and represent it in a single frame, containing temporal information. This work seeks to employ OF maps as an additional input to the nnU-Net architecture to improve its performance in the surgical instrument segmentation task, taking advantage of the fact that instruments are the main moving objects in the surgical field. With this new input, the temporal component would be indirectly added without modifying the architecture. Using CholecSeg8k dataset, three different representations of movement were estimated and used as new inputs, comparing them with a baseline model. Results showed that the use of OF maps improves the detection of classes with high movement, even when these are scarce in the dataset. To further improve performance, future work may focus on implementing other OF-preserving augmentations.
腹腔镜手术中手术器械分割对于计算机辅助手术系统至关重要。尽管近年来深度学习取得了进步,但腹腔镜手术的动态设置仍然存在对于精确分割的挑战。nnU-Net框架在语义分割中分析单帧数据时表现出色,而无需考虑时间信息。该框架的易用性(包括自动配置的能力)以及低专业要求,使其成为比较激烈的基础框架。 光流(OF)是一种常用的视频任务工具,用于估计运动并将其表示为一帧,包含时间信息。这项工作旨在将OF地图作为nnU-Net架构的额外输入,以提高其在手术器械分割任务中的性能,并利用手术领域中器械是主要运动对象的事实。通过这种新输入,可以间接地添加时间组件而无需修改架构。使用CholecSeg8k数据集,估计了三种不同的运动表示,并将其用作新的输入,与基线模型进行比较。结果显示,使用OF地图可以提高对于数据集中运动类别的检测,即使这些类别在数据集中较为稀缺。为了进一步提高性能,未来的工作可以关注实现其他OF保留的增强。
https://arxiv.org/abs/2403.10216
Landslides are one of the most destructive natural disasters in the world, posing a serious threat to human life and safety. The development of foundation models has provided a new research paradigm for large-scale landslide detection. The Segment Anything Model (SAM) has garnered widespread attention in the field of image segmentation. However, our experiment found that SAM performed poorly in the task of landslide segmentation. We propose TransLandSeg, which is a transfer learning approach for landslide semantic segmentation based on a vision foundation model (VFM). TransLandSeg outperforms traditional semantic segmentation models on both the Landslide4Sense dataset and the Bijie landslide dataset. Our proposed adaptive transfer learning (ATL) architecture enables the powerful segmentation capability of SAM to be transferred to landslide detection by training only 1.3% of the number of the parameters of SAM, which greatly improves the training efficiency of the model. Finally we also conducted ablation experiments on models with different ATL structures, concluded that the deployment location and residual connection of ATL play an important role in TransLandSeg accuracy improvement.
滑坡是世界上最具破坏性的自然灾害之一,对人类生命和安全构成严重威胁。基于基础模型的滑坡检测研究新范式已经取得了一定的进展。在图像分割领域,Segment Anything Model (SAM) 已经引起了广泛的关注。然而,我们的实验发现,SAM在滑坡分割任务中的表现不佳。我们提出了TransLandSeg,一种基于视觉基础模型(VFM)的滑坡语义分割传输学习方法。TransLandSeg在Landslide4Sense数据集和Bijie landslide数据集上都优于传统的语义分割模型。我们提出的自适应传输学习(ATL)架构通过仅训练SAM的1.3%参数,实现了SAM强大的分割能力,大大提高了模型的训练效率。最后,我们还对具有不同ATL结构模型的模型进行了消融实验,结果表明,ATL的部署位置和残差连接在TransLandSeg准确性改进中发挥着重要作用。
https://arxiv.org/abs/2403.10127
Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
无监督域适应(UDA)对于减轻标注3D点云数据的劳动量并在面对新定义领域时缓解缺乏标签非常重要。最近,出现了许多利用图像增强跨域3D分割性能的方法。然而,由于其固有的噪声问题,伪标签,即从预训练于源域的模型生成的提供给未见领域额外监督信号的模型,在用于3D分割时是不够的,从而限制了神经网络的准确性。随着二维视觉基础模型(VFMs)的出现及其丰富的知识储备,我们提出了一个名为VFMSeg的新管道VFMSeg,通过利用这些模型进一步增强了跨模态的无监督域适应框架。 在这项工作中,我们研究了VFMs通过学习知识储备如何为未标记的目标域产生更准确标签,从而提高整体性能。首先,我们利用预训练的跨模态VFM,该模型在大型图像-文本对上进行预训练,为源域的图像和点云提供监督标签(VFM-PL)。然后,我们选择另一个在细粒度2D掩码上训练的VFM,该模型用于生成语义增强的图像和点云,以提高神经网络的性能,这些数据来自源域和目标域,就像视差(FrustumMixing)一样混合数据。最后,我们将类别级别的预测合并,以产生更准确的无标记目标域的注释。 我们对各种自动驾驶数据集进行了评估,结果表明,我们的方法在3D分割任务上取得了显著的改进。
https://arxiv.org/abs/2403.10001