Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.
大量研究表明,基于Vision Transformer(ViT)的方法在各种计算机视觉任务中表现出了强大的性能。然而,ViT模型通常很难有效地捕捉图像中的高频成分,这些成分对于检测小目标并保留边缘细节在复杂场景中至关重要。在结肠癌分割等任务中,这种局限尤其具有挑战性,因为结肠癌在结构和纹理上表现出很大变异性。高频信息(如边界细节)对于实现这一任务的精确语义分割至关重要。为解决这些挑战,我们提出了HiFiSeg,一种用于结肠癌分割的新网络,通过全局局部视觉Transformer框架增强高频信息处理能力。HiFiSeg利用金字塔视觉Transformer(PVT)作为编码器,并引入了两个关键模块:全局-局部交互模块(GLIM)和选择性聚合模块(SAM)。GLIM采用并行结构将全局和局部信息在多个尺度上融合,有效捕捉细粒度特征。SAM选择性地将低层特征的边界细节与高层特征的语义信息相结合,显著提高了模型准确检测和分割结肠癌的能力。在五个广泛认可的基准数据集上进行的大量实验证明,HiFiSeg在结肠癌分割方面具有很好的效果。值得注意的是,CVC-ColonDB和ETIS数据集上的mDice得分分别达到0.826和0.822,分别表明HiFiSeg在处理这一任务的复杂性方面具有卓越的表现。
https://arxiv.org/abs/2410.02528
The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.
扩散模型不仅在图像生成领域取得了显著的成就,还在利用未标注数据作为有效预训练方法方面展现了其潜力。从扩散模型在语义匹配和开放词汇分割领域展示的广泛潜力中,我们的工作开始了研究,探讨使用潜在扩散模型进行少样本语义分割。近年来,受到大型语言模型在上下文理解能力的影响,少样本语义分割已经演变成评估通用分割模型的关键要素。在这种情况下,我们专注于少样本语义分割,为基于扩散模型的通用分割模型的发展奠定了坚实的基础。我们的初始关注点在于理解如何促进查询图像和支持图像之间的交互,从而在自注意力框架内提出KV融合方法。随后,我们深入研究了如何优化支持掩码中信息的注入以及同时重新评估如何从查询掩码提供合理的监督。根据我们的分析,我们建立了一个简单而有效的框架,名为DiffewS,保留了原始潜在扩散模型的生成框架,并有效利用了预训练的先前知识。实验结果表明,我们的方法在多个设置中显著优于之前的最佳模型。
https://arxiv.org/abs/2410.02369
3D scene understanding is crucial for facilitating seamless interaction between digital devices and the physical world. Real-time capturing and processing of the 3D scene are essential for achieving this seamless integration. While existing approaches typically separate acquisition and processing for each frame, the advent of resolution-scalable 3D sensors offers an opportunity to overcome this paradigm and fully leverage the otherwise wasted acquisition time to initiate processing. In this study, we introduce VX-S3DIS, a novel point cloud dataset accurately simulating the behavior of a resolution-scalable 3D sensor. Additionally, we present RESSCAL3D++, an important improvement over our prior work, RESSCAL3D, by incorporating an update module and processing strategy. By applying our method to the new dataset, we practically demonstrate the potential of joint acquisition and semantic segmentation of 3D point clouds. Our resolution-scalable approach significantly reduces scalability costs from 2% to just 0.2% in mIoU while achieving impressive speed-ups of 15.6 to 63.9% compared to the non-scalable baseline. Furthermore, our scalable approach enables early predictions, with the first one occurring after only 7% of the total inference time of the baseline. The new VX-S3DIS dataset is available at this https URL.
3D场景理解对于促进数字设备与物理世界之间的无缝互动至关重要。对3D场景的实时捕捉和处理对于实现这种无缝集成至关重要。虽然现有的方法通常将每个帧的获取和处理分开,但分辨率可扩展的3D传感器的出现为克服这种范式提供了机会,并充分利用了浪费的获取时间来启动处理。在这项研究中,我们引入了VX-S3DIS,一个准确模拟分辨率可扩展3D传感器行为的新型点云数据集。此外,我们还介绍了RESSCAL3D++,这是我们对之前工作的重大改进,通过包括更新模块和处理策略。通过将我们的方法应用于新数据集,我们实际上证明了3D点云的联合获取和语义分割的潜力。我们的分辨率可扩展方法在mIoU方面的可扩展性成本显著从2%降低到0.2%,同时比非可扩展基线实现了令人印象深刻的速度提升(15.6%到63.9%)。此外,我们的可扩展方法还允许早期预测,第一个发生在大约7%的基准推理时间之后。新的VX-S3DIS数据集可以在这个https:// URL上找到。
https://arxiv.org/abs/2410.02323
Recently, the integration of the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprint. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. The incorporation of a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.
最近,将卷积神经网络(CNNs)在语义分割领域的局部建模能力与Transformer的全局依赖强度相结合,在语义分割社区引起了轰动。然而,大量的计算工作和高硬件内存需求仍然是它们在实时场景中进一步应用的主要障碍。在这项工作中,我们提出了一个轻量级的多个信息交互网络,称为LMIINet,用于实时语义分割,它有效地将CNNs和Transformer结合起来,同时减少了冗余计算和内存足迹。它具有轻量化的Feature Interaction Bottleneck(LFIB)模块,包括高效的卷积,增强了上下文整合。此外,通过增强局部和全局特征交互来捕捉详细语义信息,对平铺Transformer进行改进。在LFIB和Transformer块中引入组合系数学习方案,促进了特征交互。大量实验证明,LMIINet在平衡准确性和效率方面表现出色。仅需0.72M参数和11.74G FLOPs,LMIINet在Cityscapes测试集上的mIoU为72.0%,在CamVid测试数据集上的mIoU为69.94%。
https://arxiv.org/abs/2410.02224
Melanoma segmentation in Whole Slide Images (WSIs) is useful for prognosis and the measurement of crucial prognostic factors such as Breslow depth and primary invasive tumor size. In this paper, we present a novel approach that uses the Segment Anything Model (SAM) for automatic melanoma segmentation in microscopy slide images. Our method employs an initial semantic segmentation model to generate preliminary segmentation masks that are then used to prompt SAM. We design a dynamic prompting strategy that uses a combination of centroid and grid prompts to achieve optimal coverage of the super high-resolution slide images while maintaining the quality of generated prompts. To optimize for invasive melanoma segmentation, we further refine the prompt generation process by implementing in-situ melanoma detection and low-confidence region filtering. We select Segformer as the initial segmentation model and EfficientSAM as the segment anything model for parameter-efficient fine-tuning. Our experimental results demonstrate that this approach not only surpasses other state-of-the-art melanoma segmentation methods but also significantly outperforms the baseline Segformer by 9.1% in terms of IoU.
在 whole slide images (WSIs) 中对黑色素瘤进行分割划分有助于预测和测量至关重要的预后因素,如 Breslow 深度和原发侵略性肿瘤大小。在本文中,我们提出了一种新方法,利用 Segment Anything Model(SAM)对 WSIs 中的黑色素瘤进行自动分割划分。我们的方法采用了一个初始语义分割模型生成初步分割掩码,然后用于提示 SOM。我们设计了一种动态提示策略,结合了中心点和网格提示,在保持生成提示质量的同时实现超高清 WSIs 的全面覆盖。为了优化侵袭性黑色素瘤的分割,我们通过实现原位 melanoma 检测和低置信度区域过滤来进一步优化提示生成过程。我们选择 Segformer 作为初始分割模型,EfficientSAM 作为分割 anything 模型进行参数高效的微调。我们的实验结果表明,这种方法不仅在其他最先进的黑色素瘤分割方法中超过了它们,而且在 IoU 方面显著优于基线方法 9.1%。
https://arxiv.org/abs/2410.02207
Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \url{this https URL}
遥感图像在农业、水资源、军事和灾害救济等领域中发挥着不可替代的作用。像素级别解释是遥感图像应用的关键方面;然而,普遍的局限性仍然是需要进行大量手动注释。为了克服这个问题,我们尝试将开放词汇语义分割(OVSS)引入到遥感领域。然而,由于遥感图像对低分辨率特征的敏感性,预测掩码中显示了扭曲的目标形状和拟合边界的现象。为了应对这个问题,我们提出了一个简单且通用的上采样方法,SimFeatUp,以在训练过程中恢复深层特征的失去空间信息。此外,根据对CLIP中局部补丁token对[CLS]token的异常响应的观察,我们提出了一种直接减法操作来减轻全局边界的偏差。在17个遥感的数据集上进行广泛的实验,包括语义分割、构建、道路检测和洪水检测任务。我们的方法在4个任务上实现了平均5.8%、8.2%、4%和15.3%的改进。所有代码都已发布。\url{这个链接}
https://arxiv.org/abs/2410.01768
In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the ``relation insensitive" problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at this https URL.
在本文中,我们探讨了一个新颖的基于文本监督的极化域语义分割(TESS)任务,该任务旨在为受到文本级标签弱监督的极化域图像的像素级类别分配。在具有前瞻性的任务中,极化域场景包含密集的佩戴者-物体关系和物体之间的相互干扰。然而,最先进的第三视图方法依赖于预训练的面向语义的三视图模型,由于“关系不敏感”问题,在极化视图上表现不佳。因此,我们提出了一个认知传输和去耦网络(CTDN),它首先通过相关性图像和文本来学习极化域佩戴者-物体关系。此外,还开发了一个认知传输模块(CTM),用于从大规模预训练模型中提取认知知识,以便识别具有各种语义特征的极化域物体。基于转移的认知,前景-背景去耦分块器(FDM)将视觉表示分离,明确区分前景和背景区域,从而在极化关系学习过程中减少虚假激活区域。在四个TESS基准测试中进行的大量实验证明了我们方法的的有效性,我们的方法在许多相关方法中取得了巨大的优势。代码将在此处https:// URL中提供。
https://arxiv.org/abs/2410.01341
This paper introduces a new approach to extract and analyze vector data from technical drawings in PDF format. Our method involves converting PDF files into SVG format and creating a feature-rich graph representation, which captures the relationships between vector entities using geometrical information. We then apply a graph attention transformer with hierarchical label definition to achieve accurate line-level segmentation. Our approach is evaluated on two datasets, including the public FloorplanCAD dataset, which achieves state-of-the-art results on weighted F1 score, surpassing existing methods. The proposed vector-based method offers a more scalable solution for large-scale technical drawing analysis compared to vision-based approaches, while also requiring significantly less GPU power than current state-of-the-art vector-based techniques. Moreover, it demonstrates improved performance in terms of the weighted F1 (wF1) score on the semantic segmentation task. Our results demonstrate the effectiveness of our approach in extracting meaningful information from technical drawings, enabling new applications, and improving existing workflows in the AEC industry. Potential applications of our approach include automated building information modeling (BIM) and construction planning, which could significantly impact the efficiency and productivity of the industry.
本文提出了一种从PDF格式的技术图纸上提取和分析向量数据的新方法。我们的方法涉及将PDF文件转换为SVG格式并创建一个丰富的图形表示,该表示通过几何信息捕捉向量实体之间的关系。然后,我们应用具有分层标签定义的图注意力Transformer来实现准确的分层线级分割。我们对两种数据集进行了评估,包括公开的FloorplanCAD数据集,该数据集在加权F1分数上实现了最先进的结果,超过了现有方法。基于向量的方法提供了一种比基于视觉方法的更可扩展的解决方案,后者需要比当前最先进的向量方法更少的GPU功率。此外,它在语义分割任务上的表现也得到了提高。我们的结果表明,我们的方法从技术图中提取有意义的信息,为新的应用提供了可能,并改善了现有工作流程。我们方法潜在的应用包括自动建筑信息建模(BIM)和建筑规划,这将对行业的效率和生产力产生重大影响。
https://arxiv.org/abs/2410.01336
Few-shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions and equipment sequences). How to enhance the FSMIS models to generalize well across the different specific medical imaging domains? In this paper, we focus on the matching mechanism of the few-shot semantic segmentation models and introduce an Earth Mover's Distance (EMD) calculation based domain robust matching mechanism for the cross-domain scenario. Specifically, we formulate the EMD transportation process between the foreground support-query features, the texture structure aware weights generation method, which proposes to perform the sobel based image gradient calculation over the nodes, is introduced in the EMD matching flow to restrain the domain relevant nodes. Besides, the point set level distance measurement metric is introduced to calculated the cost for the transportation from support set nodes to query set nodes. To evaluate the performance of our model, we conduct experiments on three scenarios (i.e., cross-modal, cross-sequence and cross-institution), which includes eight medical datasets and involves three body regions, and the results demonstrate that our model achieves the SoTA performance against the compared models.
少样本医疗图像分割(FSMIS)旨在在医学图像分析范围内执行有限标注数据的训练和部署。尽管已经取得了一定的进展,但目前的FSMIS模型都在相同的数据域上进行训练和部署,这并不符合临床实践中医疗影像数据总是在不同数据域之间(如成像方式、机构和设备序列)的特点。如何增强FSMIS模型在跨域情况下的泛化能力?在本文中,我们关注少样本语义分割模型的匹配机制,并引入了跨域情况下基于域的鲁棒匹配机制。具体来说,我们在EMD匹配流中引入了基于节点的Sobel图像梯度计算,以约束域相关节点。此外,我们还引入了点集级别距离测量指标,以计算从支持集节点到查询集节点的传输成本。为了评估我们模型的性能,我们在三个场景(即跨模态、跨序列和跨机构)上进行了实验(包括八个医疗数据集和三个身体区域),结果表明,我们的模型与比较模型相比实现了SoTA性能。
https://arxiv.org/abs/2410.01110
The escalating use of Unmanned Aerial Vehicles (UAVs) as remote sensing platforms has garnered considerable attention, proving invaluable for ground object recognition. While satellite remote sensing images face limitations in resolution and weather susceptibility, UAV remote sensing, employing low-speed unmanned aircraft, offers enhanced object resolution and agility. The advent of advanced machine learning techniques has propelled significant strides in image analysis, particularly in semantic segmentation for UAV remote sensing images. This paper evaluates the effectiveness and efficiency of SegFormer, a semantic segmentation framework, for the semantic segmentation of UAV images. SegFormer variants, ranging from real-time (B0) to high-performance (B5) models, are assessed using the UAVid dataset tailored for semantic segmentation tasks. The research details the architecture and training procedures specific to SegFormer in the context of UAV semantic segmentation. Experimental results showcase the model's performance on benchmark dataset, highlighting its ability to accurately delineate objects and land cover features in diverse UAV scenarios, leading to both high efficiency and performance.
无人机(UAVs)作为遥感平台的应用不断引起关注,这证明其在地面目标识别方面具有巨大的价值。尽管卫星遥感图像在分辨率和对天气的敏感性方面存在局限性,但使用低速无人机进行遥感,可以提高对象的分辨率和大致度。先进机器学习技术的出现推动了图像分析的显著进步,特别是在UAV遥感图像的语义分割方面。本文评估了SegFormer语义分割框架在语义分割UAV图像上的有效性和效率。SegFormer变体(从实时(B0)到高性能(B5)模型)使用专门为语义分割任务定制的UAVid数据集进行评估。研究详细介绍了SegFormer在UAV语义分割架构和训练程序方面的具体研究。实验结果展示了模型在基准数据集上的表现,突出了其在不同UAV场景中准确描绘物体和土地覆盖特征的能力,从而实现了高效率和性能。
https://arxiv.org/abs/2410.01092
Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher-level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models' prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on multiple datasets show that our method consistently improves model test-time performance, complementing existing test-time augmentation techniques. Our code is available at this https URL.
子采样层在深度网络中起着关键作用,通过丢弃激活图的一部分来降低其空间维度。这鼓励深度网络学习更高层次的表示。然而,我们假设被丢弃的激活值是有用的,并且可以在运行时动态地集成以提高模型的预测。为了验证我们的假设,我们提出了一个搜索和聚合方法,以找到要在测试时使用的有用的激活图。我们将我们的方法应用于图像分类和语义分割任务。在多个数据集上进行的大量实验证明,我们的方法持续地改善了模型的测试时间性能,补充了现有的测试时间增强技术。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2410.01083
Efficient point cloud (PC) compression is crucial for streaming applications, such as augmented reality and cooperative perception. Classic PC compression techniques encode all the points in a frame. Tailoring compression towards perception tasks at the receiver side, we ask the question, "Can we remove the ground points during transmission without sacrificing the detection performance?" Our study reveals a strong dependency on the ground from state-of-the-art (SOTA) 3D object detection models, especially on those points below and around the object. In this work, we propose a lightweight obstacle-aware Pillar-based Ground Removal (PGR) algorithm. PGR filters out ground points that do not provide context to object recognition, significantly improving compression ratio without sacrificing the receiver side perception performance. Not using heavy object detection or semantic segmentation models, PGR is light-weight, highly parallelizable, and effective. Our evaluations on KITTI and Waymo Open Dataset show that SOTA detection models work equally well with PGR removing 20-30% of the points, with a speeding of 86 FPS.
高效的点云(PC)压缩对于流式应用(如增强现实和合作感知)至关重要。经典的PC压缩技术将帧中的所有点进行编码。将压缩适应接收端的感知任务,我们提出一个问题:“在传输过程中,我们能否在保留检测性能的同时消除地面点?”我们的研究揭示了来自最先进(SOTA)3D物体检测模型的地面点的强烈依赖,尤其是那些位于物体周围和下方的点。在这项工作中,我们提出了一个轻量级的基于小支柱的地面去除(PGR)算法。PGR排除了不提供物体识别上下文的地点,显著提高了压缩比,同时不牺牲接收端的感知性能。不使用沉重的物体检测或语义分割模型,PGR轻便、高度并行,且有效。我们对KITTI和Waymo Open Dataset的评估结果表明,与PGR一起移除20-30%的点,检测模型的性能相同,速度为86 FPS。
https://arxiv.org/abs/2410.00582
Capturing real-world 3D spaces as point clouds is efficient and descriptive, but it comes with sensor errors and lacks object parametrization. These limitations render point clouds unsuitable for various real-world applications, such as robot programming, without extensive post-processing (e.g., outlier removal, semantic segmentation). On the other hand, CAD modeling provides high-quality, parametric representations of 3D space with embedded semantic data, but requires manual component creation that is time-consuming and costly. To address these challenges, we propose a novel solution that combines the strengths of both approaches. Our method for 3D workcell sketching from point clouds allows users to refine raw point clouds using an Augmented Reality (AR) interface that leverages their knowledge and the real-world 3D environment. By utilizing a toolbox and an AR-enabled pointing device, users can enhance point cloud accuracy based on the device's position in 3D space. We validate our approach by comparing it with ground truth models, demonstrating that it achieves a mean error within 1cm - significant improvement over standard LiDAR scanner apps.
捕获真实世界3D空间的点云作为点云是一种高效和描述性的方法,但同时也存在传感器误差和缺乏对象参数化。这些限制使得点云不适合各种现实应用,例如机器人编程,在没有广泛的后处理(例如,异常删除和语义分割)的情况下。另一方面,CAD建模提供了高质量、带有嵌入式语义数据的3D空间参数化表示,但需要手动创建组件,这耗时且代价昂贵。为了应对这些挑战,我们提出了一个结合两种方法优势的新颖解决方案。我们基于点云的工作细胞绘制方法允许用户通过增强现实(AR)界面优化原始点云,并利用他们的知识和现实世界3D环境。通过使用工具箱和AR enabled的指针设备,用户可以根据设备在3D空间中的位置增强点云准确性。我们通过与地面真实模型进行比较验证我们的方法,结果表明它实现了1cm内的平均误差 - 显著优于标准LiDAR扫描器应用。
https://arxiv.org/abs/2410.00479
Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation using multi-source optical imagery. The findings highlight the importance of multi-modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.
准确的地形分割遥感影像对于各种地球观测应用至关重要,如土地覆盖地图、城市规划和国际环境监测。然而,数据源通常存在限制此任务的局限性。高分辨率(VHR)航空影像提供了丰富的空间细节,但不能捕捉到土地覆盖变化的时间信息。相反,卫星图像时间序列(SITS)可以捕捉到植被随季节变化的动态,但分辨率有限,导致难以区分细粒度物体。本文提出了一种 late fusion deep learning model(LF-DLM)用于语义分割,该模型利用VHR航空影像和SITS的互补优势。所提出的模型由两个独立的深度学习分支组成。一个分支将UNetFormer捕获的航空影像的详细纹理与多轴视觉Transformer(MaxViT)骨干相结合。另一个分支使用U-Net Temporal Attention Encoder(U-TAE)从Sentinel-2卫星图像时间序列中捕获复杂的空间-时间动态。这种方法在FLAIR数据集上取得了最先进的成果,这是使用多源光学影像进行大面积土地覆盖分割的一个大型的基准。这些发现突出了在遥感应用中进行多模态融合提高语义分割精度和鲁棒性的重要性。
https://arxiv.org/abs/2410.00469
Autonomous racing demands safe control of vehicles at their physical limits for extended periods of time, providing insights into advanced vehicle safety systems which increasingly rely on intervention provided by vehicle autonomy. Participation in this field carries with it a high barrier to entry. Physical platforms and their associated sensor suites require large capital outlays before any demonstrable progress can be made. Simulators allow researches to develop soft autonomous systems without purchasing a platform. However, currently available simulators lack visual and dynamic fidelity, can still be expensive to buy, lack customisation, and are difficult to use. AARK provides three packages, ACI, ACDG, and ACMPC. These packages enable research into autonomous control systems in the demanding environment of racing to bring more people into the field and improve reproducibility: ACI provides researchers with a computer vision-friendly interface to Assetto Corsa for convenient comparison and evaluation of autonomous control solutions; ACDG enables generation of depth, normal and semantic segmentation data for training computer vision models to use in perception systems; and ACMPC gives newcomers to the field a modular full-stack autonomous control solution, capable of controlling vehicles to build from. AARK aims to unify and democratise research into a field critical to providing safer roads and trusted autonomous systems.
自动驾驶赛车对车辆在物理极限范围内的安全控制提出了要求,为人们深入了解自动驾驶系统提供了对高级车辆安全系统的深入了解,这些系统 increasingly依赖于车辆自主性的干预。参与这个领域需要很高的门槛。物理平台及其相关传感器套件需要巨额资金投入,在实现任何可见的进步之前,都需要投入大量资源。仿真器允许研究人员在不购买平台的情况下开发软自主系统。然而,目前可用的仿真器缺乏视觉和动态精度,仍然很难购买,缺乏定制化,而且很难使用。AARK提供了三个软件包,ACI,ACDG和ACMPC。这些软件包使研究人员能够在赛车领域对自动驾驶系统进行研究,为更多人进入这个领域并提高可重复性做出了贡献:ACI为研究人员提供了一个计算机视觉友好的界面,以便比较和评估自动驾驶解决方案;ACDG能够生成深度、法和语义分割数据,用于训练计算机视觉模型用于感知系统;而ACMPC为新手研究人员提供了一个模块化的全栈自动驾驶解决方案,可用于控制车辆构建。AARK的目标是将和民主化该领域的研究,以提供更安全的道路和值得信赖的自动驾驶系统。
https://arxiv.org/abs/2410.00358
Scene sketch semantic segmentation is a crucial task for various applications including sketch-to-image retrieval and scene understanding. Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes due to the shift from vector to image format. Moreover, these methods struggle to segment objects from categories absent in the training data. In this paper, we propose a Class-Agnostic Visio-Temporal Network (CAVT) for scene sketch semantic segmentation. CAVT employs a class-agnostic object detector to detect individual objects in a scene and groups the strokes of instances through its post-processing module. This is the first approach that performs segmentation at both the instance and stroke levels within scene sketches. Furthermore, there is a lack of free-hand scene sketch datasets with both instance and stroke-level class annotations. To fill this gap, we collected the largest Free-hand Instance- and Stroke-level Scene Sketch Dataset (FrISS) that contains 1K scene sketches and covers 403 object classes with dense annotations. Extensive experiments on FrISS and other datasets demonstrate the superior performance of our method over state-of-the-art scene sketch segmentation models. The code and dataset will be made public after acceptance.
场景图语义分割是一个关键的任务,包括手绘图像检索和场景理解。现有的手绘分割方法将手绘图视为位图图像,导致由于从矢量格式到图像格式的转换而丢失了时间顺序 among 画笔。此外,这些方法在训练数据中存在的类别上分割物体也存在困难。在本文中,我们提出了一个类无关的Visio-Temporal网络(CAVT)来进行场景图语义分割。CAVT采用一个类无关的对象检测器来检测场景中的单个物体,并通过后处理模块将实例的画笔分组。这是在场景图手稿中进行分割的第一种方法。此外,尚未免费提供实例和画笔级类注释的场景图数据集。为了填补这个空白,我们收集了包含1K个场景图手稿和403个类别的最大免费手绘实例和画笔级场景图数据集(FrISS)。在FrISS和其他数据集上的大量实验证明,我们的方法在场景图分割模型中具有卓越的性能。代码和数据集将在接受后公开。
https://arxiv.org/abs/2410.00266
The Area Under the ROC Curve (AUC) is a well-known metric for evaluating instance-level long-tail learning problems. In the past two decades, many AUC optimization methods have been proposed to improve model performance under long-tail distributions. In this paper, we explore AUC optimization methods in the context of pixel-level long-tail semantic segmentation, a much more complicated scenario. This task introduces two major challenges for AUC optimization techniques. On one hand, AUC optimization in a pixel-level task involves complex coupling across loss terms, with structured inner-image and pairwise inter-image dependencies, complicating theoretical analysis. On the other hand, we find that mini-batch estimation of AUC loss in this case requires a larger batch size, resulting in an unaffordable space complexity. To address these issues, we develop a pixel-level AUC loss function and conduct a dependency-graph-based theoretical analysis of the algorithm's generalization ability. Additionally, we design a Tail-Classes Memory Bank (T-Memory Bank) to manage the significant memory demand. Finally, comprehensive experiments across various benchmarks confirm the effectiveness of our proposed AUCSeg method. The code is available at this https URL.
AUC曲线下的面积(AUC)是一个用于评估实例级别长尾学习问题的著名度量。在过去的20年里,许多AUC优化方法已经被提出以改善在长尾分布下模型的性能。在本文中,我们在像素级别的长尾语义分割场景中探讨了AUC优化方法。这个任务为AUC优化技术提出了两个主要挑战。一方面,在像素级别任务中,AUC优化涉及损失函数复杂耦合,具有结构和内图像及成对图像之间的依赖关系,这使得理论分析变得复杂。另一方面,我们发现,在这种情况下,通过最小批量的AUC损失估计需要更大的批量大,导致空间复杂性不可承受。为了应对这些问题,我们开发了一个像素级别的AUC损失函数,并进行了基于依赖关系的理论分析来评估算法的泛化能力。此外,我们还设计了一个Tail-Classes Memory Bank(T-Memory Bank)来管理这种显著的内存需求。最后,在各种基准测试上进行了全面的实验,证实了我们提出的AUCSeg方法的有效性。代码可在此处访问:https://www.academia.edu/39411041/Towards_A_More_Robust_AUC_Segmentation_算法。
https://arxiv.org/abs/2409.20398
Convolutional neural networks (CNNs) achieve prevailing results in segmentation tasks nowadays and represent the state-of-the-art for image-based analysis. However, the understanding of the accurate decision-making process of a CNN is rather unknown. The research area of explainable artificial intelligence (xAI) primarily revolves around understanding and interpreting this black-box behavior. One way of interpreting a CNN is the use of class activation maps (CAMs) that represent heatmaps to indicate the importance of image areas for the prediction of the CNN. For classification tasks, a variety of CAM algorithms exist. But for segmentation tasks, only one CAM algorithm for the interpretation of the output of a CNN exist. We propose a transfer between existing classification- and segmentation-based methods for more detailed, explainable, and consistent results which show salient pixels in semantic segmentation tasks. The resulting Seg-HiRes-Grad CAM is an extension of the segmentation-based Seg-Grad CAM with the transfer to the classification-based HiRes CAM. Our method improves the previously-mentioned existing segmentation-based method by adjusting it to recently published classification-based methods. Especially for medical image segmentation, this transfer solves existing explainability disadvantages.
卷积神经网络(CNNs)在目前的分割任务中实现了一贯的成果,并代表了基于图像的分析 state-of-the-art。然而,对 CNN 的准确决策过程的理解仍然相当未知。可解释人工智能(xAI)的研究主要围绕理解并解释这种黑匣子行为展开。解释性 CNN 的解释的一种方法是使用类激活图(CAMs),这些图表示热图以表示图像区域对于 CNN 的预测的重要性。对于分类任务,存在多种 CAM 算法。但对于分割任务,只有用于解释 CNN 输出的一种 CAM 算法。我们提出了一种在现有分类和分割方法之间进行转移以获得更详细、可解释和一致结果的方法,这些结果在语义分割任务中表现出显著的像素。得到的 Seg-HiRes-Grad CAM 是基于分割的 Seg-Grad CAM 的扩展,通过将类基于分类的 HiRes CAM 进行转移。我们的方法通过调整最近发布的分类基于方法来改进之前提到的基于分割的方法。特别是对于医学图像分割,这种转移解决了现有的可解释性不足。
https://arxiv.org/abs/2409.20287
Data augmentation is one of the most common tools in deep learning, underpinning many recent advances including tasks such as classification, detection, and semantic segmentation. The standard approach to data augmentation involves simple transformations like rotation and flipping to generate new images. However, these new images often lack diversity along the main semantic dimensions within the data. Traditional data augmentation methods cannot alter high-level semantic attributes such as the presence of vehicles, trees, and buildings in a scene to enhance data diversity. In recent years, the rapid development of generative models has injected new vitality into the field of data augmentation. In this paper, we address the lack of diversity in data augmentation for road detection task by using a pre-trained text-to-image diffusion model to parameterize image-to-image transformations. Our method involves editing images using these diffusion models to change their semantics. In essence, we achieve this goal by erasing instances of real objects from the original dataset and generating new instances with similar semantics in the erased regions using the diffusion model, thereby expanding the original dataset. We evaluate our approach on the KITTI road dataset and achieve the best results compared to other data augmentation methods, which demonstrates the effectiveness of our proposed development.
数据增强是深度学习中的一种最常用的工具,为许多 recent 取得的主要进展提供了支持,包括分类、检测和语义分割等任务。数据增强的标准方法包括简单的旋转和翻转等变换来生成新的图像。然而,这些新图像在数据的主要语义维度内通常缺乏多样性。传统的数据增强方法无法改变数据中高级语义属性(如场景中是否存在车辆、树木和建筑物)来增强数据多样性。在近年来,生成模型的快速发展为数据增强领域注入了新的活力。在本文中,我们通过使用预训练的文本到图像扩散模型来参数化图像到图像变换,从而解决数据增强中缺乏多样性的问题。我们的方法包括使用这些扩散模型编辑图像以改变其语义。本质上,我们通过从原始数据集中消除真实对象的实例,并在被消除的区域使用扩散模型生成具有相似语义的新实例,从而扩展了原始数据集。我们在 KITTI 道路数据集上评估我们的方法,与其他数据增强方法相比,取得了最佳结果,这证明了我们所提出的开发的有效性。
https://arxiv.org/abs/2409.20164
In the woodworking industry, a huge amount of effort has to be invested into the initial quality assessment of the raw material. In this study we present an AI model to detect, quantify and localize defects on wooden logs. This model aims to both automate the quality control process and provide a more consistent and reliable quality assessment. For this purpose a dataset of 1424 sample images of wood logs is created. A total of 5 annotators possessing different levels of expertise is involved in dataset creation. An inter-annotator agreement analysis is conducted to analyze the impact of expertise on the annotation task and to highlight subjective differences in annotator judgement. We explore, train and fine-tune the state-of-the-art InternImage and ONE-PEACE architectures for semantic segmentation. The best model created achieves an average IoU of 0.71, and shows detection and quantification capabilities close to the human annotators.
在木工行业,对原材料的初始质量评估需要投入大量精力。在这项研究中,我们提出了一种AI模型来检测、量化和定位木质日志中的缺陷。这个模型旨在自动化质量控制过程,并提供更一致和可靠的评估。为此,创建了一个包含1424个木质日志样本的数据集。数据集中有5个具有不同专业水平的注释者参与。进行了跨注释者一致性分析,以分析专业知识对注释任务的影响,并突出注释者判断中的主观差异。我们探讨、训练和微调了最先进的InternImage和ONE-PEACE架构,用于语义分割。最佳模型创建的平均IoU为0.71,展示了检测和量化能力与人类注释者相当。
https://arxiv.org/abs/2409.20137