Scene_Parsing

Fully Exploiting Vision Foundation Model's Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing

2025-02-10 07:50:22

Sicen Guo, Tianyou Wen, Chuang-Wei Liu, Qijun Chen, Rui Fan

arXiv_CV

arXiv_CV Knowledge Prediction Transformer Pose Action Scene_Parsing
Abstract

Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at this https URL.

Abstract (translated)

最近的视觉基础模型（VFMs），通常基于Vision Transformer (ViT)，已经在众多计算机视觉任务中取得了显著的进步。尽管它们在仅关注RGB图像的任务上表现出色，但在RGB-深度驾驶场景解析中的潜力却尚未被充分探索。本文朝着这一新兴研究领域迈出了一步，通过调查一种可行的技术来充分利用VFMs进行通用的RGB-深度驾驶场景解析。具体而言，我们探讨了RGB和深度数据的基本特性，并提出了异构特征集成变换器（HFIT）。该网络能够高效地提取并整合全面的异构特征，而无需重新训练ViT模型。来自VFMs的相关深度预测结果被用作HFIT侧适配器的输入，从而克服了对深度图依赖的限制。我们提出的HFIT在Cityscapes和KITTI Semantics数据集上与所有传统的单模态和数据融合场景解析网络、预训练的VFMs以及ViT适配器相比，表现出了更优越的性能。我们认为这一新颖策略为基于VFM的数据融合技术在未来驾驶场景解析中的创新铺平了道路。我们的源代码公开可获取于[提供的URL]。

URL

https://arxiv.org/abs/2502.06219

PDF

https://arxiv.org/pdf/2502.06219.pdf
Read All
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

2024-11-05 07:02:25

Pranav Gupta, Rishubh Singh, Pradeep Shenoy, Ravikiran Sarvadevabhatla

arXiv_CV

arXiv_CV Segmentation Optimization Transformer Pose Scene_Parsing
Abstract

Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of $\mathbf{3.3}$ (Pascal-Parts-58), $\mathbf{3.5}$ (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is $\mathbf{4.0}$. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets. The code is available at this http URL

Abstract (translated)

多对象多部件场景分割是一项具有挑战性的任务，其复杂性随零件的细致程度和场景中对象数量呈指数级增长。为了解决这一问题，我们提出了一种可插拔的方法，称为OLAF。首先，我们在输入（RGB）中加入包含基于物体结构线索的通道（前景/背景掩模、边界边缘掩模）。我们提出了一种权重适应技术，使常规预训练模型在优化过程中能够稳定处理增强后的5通道输入。此外，我们引入了一个编码模块LDF，以提供低级密集特征指导，这有助于分割，特别是对较小的部分来说更为有效。OLAF使得mIoU得分显著提升，在Pascal-Parts-58数据集上提升了$\mathbf{3.3}$，在Pascal-Parts-108数据集上提升了$\mathbf{3.5}$，超过现有最佳模型的表现。在最具挑战性的变体（Pascal-Parts-201）上，提升更是达到了$\mathbf{4.0}$。实验表明，OLAF的广泛适用性使其能够在多种架构（CNN、U-Net、Transformer）和数据集上取得提升。代码可以在以下网址获取：[此HTTP URL]

URL

https://arxiv.org/abs/2411.02858

PDF

https://arxiv.org/pdf/2411.02858.pdf
Read All
RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

2024-07-31 14:25:16

Jianxin Huang, Jiahang Li, Ning Jia, Yuxiang Sun, Chengju Liu, Qijun Chen, Rui Fan

arXiv_CV

arXiv_CV CNN Face Attention Prediction Transformer Pose Scene_Parsing
Abstract

Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65\% compared to RoadFormer. Our source code will be publicly available at mias.group/RoadFormerPlus.

Abstract (translated)

任务特定数据融合网络在城市场景解析方面取得了显著的成就。在这些网络中，我们最近提出的RoadFormer成功地从红绿蓝图像和表面法线图中提取异质特征，并通过注意力机制将这些特征融合在一起，证明了在RGB-Normal道路场景解析方面具有引人注目的效果。然而，当处理其他类型/来源的数据或执行更通用的全类别场景解析任务时，其性能显著下降。为了克服这些限制，本研究引入了RoadFormer+，一种高效、稳健、可适应的模型，能够有效地将RGB-X数据进行融合，其中“X”代表深度、热、表面法线和极化等额外类型/模块。具体来说，我们提出了一个新颖的混合特征解耦编码器，以提取异质特征并将它们解耦为全局和局部组件。这些解耦的特征通过一个双分支多尺度异质特征融合块进行融合，该块采用并行Transformer注意力和卷积神经网络模块将不同尺度和感受野上的多尺度特征合并。融合后的特征随后输入解码器以生成最终语义预测。值得注意的是，我们提出的RoadFormer+在KITTI Road基准上排名第一，并在Cityscapes、MFNet、FMB和ZJU数据集上实现了与最先进方法相同的平均交集 over 联合。此外，与RoadFormer相比，它减少了65%的可学习参数。我们的源代码将公开发布在mias.group/RoadFormerPlus上。

URL

https://arxiv.org/abs/2407.21631

PDF

https://arxiv.org/pdf/2407.21631.pdf
Read All
Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

2024-07-02 07:35:21

Chengchao Shen, Jianzhong Chen, Jianxin Wang

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Detection Object_Detection Represenation_Learning Unsupervised Pose Scene_Parsing Contrastive_Learning
Abstract

The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{this https URL}.

Abstract (translated)

目前，对比学习方法主要集中在单粒度表示学习，例如部分级别、对象级别或场景级别，从而忽略了其他粒度级别上表示的转移性。在本文中，我们的目标是学习多粒度表示，可以有效地描述图像在各个粒度级别上的特征，从而提高在广泛的下游任务上的泛化能力。为此，我们提出了一个名为多粒度对比（MGC）的无监督表示学习的新方法。具体来说，我们通过构建积极视角和对应关系之间的细腻多粒度对应关系，然后通过对应关系进行多粒度对比，学习更通用的无监督表示。在没有在大规模数据集上预训练的情况下，我们的方法在广泛的下游任务上显著超过了现有最先进的 methods，包括目标检测、实例分割、场景解析、语义分割和关键点检测。此外，实验结果证实了我们的方法具有数据有效的特性和出色的表示转移能力。源代码和训练后的权重可在此处访问：\url{这个链接}。

URL

https://arxiv.org/abs/2407.02014

PDF

https://arxiv.org/pdf/2407.02014.pdf
Read All
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation

2024-06-08 04:43:08

Qingfeng Liu, Mostafa El-Khamy, Kee-Bong Song

arXiv_CV

arXiv_CV Video_Caption Segmentation Semantic_Segmentation Tracking Transformer Scene_Parsing
Abstract

The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding.

Abstract (translated)

第三个在野外像素级别的视频理解挑战（PVUW CVPR 2024）旨在通过在大型野外视频全景分割（VIPSeg）测试集和大型野外视频场景解析（VSPW）测试集中对具有挑战性的视频和场景进行基准测试，分别推动视频理解技术的进步。本文详细介绍了我们在PVUW'24 VPS挑战中取得第一名的 research 工作，建立了包括 Video Panoptic Quality（VPQ）和 Segmentation and Tracking Quality（STQ）在内的所有指标的最佳结果。通过微调我们的方法，我们的解决方案还获得了 PVUW'24 VSS 挑战中的第三名，根据mIoU（mean intersection over union）指标排名第三，根据 VC16（16-帧视频一致性）指标排名第一。我们的获胜解决方案站在了大型基础视觉变换模型（DINOv2 ViT-g）和经过验证的多阶段解耦视频实例分割（DVIS）框架的肩膀上，为视频理解技术的发展做出了巨大贡献。

URL

https://arxiv.org/abs/2406.05352

PDF

https://arxiv.org/pdf/2406.05352.pdf
Read All
Radar Spectra-Language Model for Automotive Scene Parsing

2024-06-04 09:45:04

Mariia Pushkareva, Yuri Feldman, Csaba Domokos, Kilian Rambach, Dotan Di Castro

arXiv_CV

arXiv_CV Segmentation Detection Object_Detection Embedding Language_Model Transformer Autonomous Scene_Parsing Point_Cloud Matching
Abstract

Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model (VLM). Finally, we explore the benefit of the learned representation for scene parsing, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.

Abstract (translated)

雷达传感器具有低成本、远距离和天气耐用性。因此，它们广泛应用于驾驶辅助功能，并预计将在未来自动驾驶的成功中扮演关键角色。在许多感知任务中，只考虑预处理后的雷达点云。相比之下，雷达频谱是一种原始的雷达测量形式，含有比雷达点云更多的信息。然而，雷达频谱的解读相当困难。在这项工作中，我们旨在探索频谱中包含的语义信息，从而在自动驾驶中实现更好的可解释性。为此，我们创建了一个雷达频谱-语言模型，使我们能够使用自由文本查询雷达频谱测量中是否存在场景元素。我们通过将现有的视觉语言模型（VLM）的嵌入空间与雷达频谱数据进行匹配来克服雷达频谱数据的稀缺性。最后，我们探讨了学习表示对场景解析的好处，只需将频谱嵌入注入到基线模型中，就能获得仅通过在自由空间分割和目标检测方面实现提高。

URL

https://arxiv.org/abs/2406.02158

PDF

https://arxiv.org/pdf/2406.02158.pdf
Read All
Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

2024-06-02 01:37:26

Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Prediction Scene_Parsing
Abstract

Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.

Abstract (translated)

像素级别的场景理解是计算机视觉中的一个基本问题，旨在识别给定图像中每个像素的对象类、掩码和语义。与图像场景解析相比，视频场景解析引入了时间信息，可以有效提高预测的一致性和准确性，因为现实世界实际上是基于视频的，而不是静态的状态。在本文中，我们采用了基于不可靠伪标签的半监督视频语义分割方法。然后，我们将教师网络模型与学生网络模型集成，生成伪标签并重新训练学生网络。我们的方法在开发测试和最终测试中的mIoU得分分别为63.71%和67.83%。最后，我们在CVPR 2024上的视频场景解析挑战中获得了第一名的成绩。

URL

https://arxiv.org/abs/2406.00587

PDF

https://arxiv.org/pdf/2406.00587.pdf
Read All
Few-Shot Fruit Segmentation via Transfer Learning

2024-05-04 04:05:59

Jordan A. James, Heather K. Manching, Amanda M. Hulse-Kemp, William J. Beksi

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Detection Transfer_Learning Knowledge Transformer Pose Few-Shot Scene_Parsing Robot
Abstract

Advancements in machine learning, computer vision, and robotics have paved the way for transformative solutions in various domains, particularly in agriculture. For example, accurate identification and segmentation of fruits from field images plays a crucial role in automating jobs such as harvesting, disease detection, and yield estimation. However, achieving robust and precise infield fruit segmentation remains a challenging task since large amounts of labeled data are required to handle variations in fruit size, shape, color, and occlusion. In this paper, we develop a few-shot semantic segmentation framework for infield fruits using transfer learning. Concretely, our work is aimed at addressing agricultural domains that lack publicly available labeled data. Motivated by similar success in urban scene parsing, we propose specialized pre-training using a public benchmark dataset for fruit transfer learning. By leveraging pre-trained neural networks, accurate semantic segmentation of fruit in the field is achieved with only a few labeled images. Furthermore, we show that models with pre-training learn to distinguish between fruit still on the trees and fruit that have fallen on the ground, and they can effectively transfer the knowledge to the target fruit dataset.

Abstract (translated)

机器学习、计算机视觉和机器人技术的发展为各个领域带来了 transformative 解决方案，尤其是在农业领域。例如，准确从田间图像中识别和分割水果在自动化诸如采摘、疾病检测和产量估计等任务中扮演着关键角色。然而，实现稳健且精确的田间水果分割仍然具有挑战性，因为需要大量标记数据来处理水果的大小、形状、颜色和遮挡的变异。在本文中，我们为田间水果使用迁移学习开发了一个几 shot semantic segmentation 框架。具体来说，我们的工作旨在解决缺乏公开可用标记数据的农业领域。受到城市场景解析的成功启发，我们提出了使用公共基准数据集进行水果转移学习的专用预训练方案。通过利用预训练的神经网络，可以在仅几张标记图片的情况下实现水果在田间的准确语义分割。此外，我们还证明了经过预训练的模型能够区分仍然在树上的水果和已经掉在地上的水果，并且可以有效地将知识传递到目标水果数据集中。

URL

https://arxiv.org/abs/2405.02556

PDF

https://arxiv.org/pdf/2405.02556.pdf
Read All
Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

2024-04-29 22:03:02

Christopher J. Kymn, Sonia Mazelet, Annabel Ng, Denis Kleyko, Bruno A. Olshausen

arXiv_CV

arXiv_CV CNN Recognition Tracking Sparse Pose Scene_Parsing
Abstract

We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.

Abstract (translated)

我们提出了一个基于编码图像稀疏、潜在特征表示的视觉场景分析和识别系统。该系统将图像稀疏表示编码为高维向量，然后通过分解为解析场景内容。稀疏特征表示通过卷积稀疏编码从图像统计信息中学习，而场景解析由共振器网络完成。将稀疏编码与共振器网络相结合可以增加分布式表示的容量，并在分解过程中减少组合搜索空间中的碰撞。我们发现，对于这个问题，共振器网络能够实现快速和准确的向量分解，并且我们开发了一个基于信心的度量来协助跟踪共振器网络的收敛。

URL

https://arxiv.org/abs/2404.19126

PDF

https://arxiv.org/pdf/2404.19126.pdf
Read All
HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion

2024-04-04 15:31:11

Jiahang Li, Peng Yun, Qijun Chen, Rui Fan

arXiv_CV

arXiv_CV CNN Attention Pose Action Scene_Parsing
Abstract

Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.

Abstract (translated)

数据融合网络在色温场景解析方面表现出巨大的潜力。然而，现有的研究大多依赖于对称的多层解码器来进行异构特征提取和融合，而忽略了红光和热模态固有的差异。在通过自监督学习大量无标签数据上训练的视觉基础模型（VFMs）的最近进步证明，它们具有提取有信息量的通用特征的能力。然而，在领域内这一潜力尚未得到充分利用。在这项研究中，我们迈出这一新研究领域的一步，通过探索一种可行的策略，充分利用VFM特征进行红光-热场景解析。具体来说，我们深入研究了红光和热模态的独特特点，从而设计了一个半监督的 asymmetric 编码器，该编码器既包含一个VFM，也包含一个卷积神经网络。这种设计允许更有效地提取互补的异质特征，然后以双路、逐步的方式进行融合。此外，我们还引入了一个辅助任务，进一步丰富了融合特征的局部语义，从而提高了整个RGB-热场景解析的性能。我们提出的HAPNet，配备了所有这些组件，在所有其他最先进的RGB-热场景解析网络中表现出卓越的性能，在三处广泛使用的公共RGB-热场景解析数据集上实现了Top Rank。我们相信，这一新范式为数据融合场景解析方法的未来发展打开了新的机会。

URL

https://arxiv.org/abs/2404.03527

PDF

https://arxiv.org/pdf/2404.03527.pdf
Read All
Feature boosting with efficient attention for scene parsing

2024-02-29 15:22:21

Vivek Singh, Shailza Sharma, Fabio Cuzzolin

arXiv_CV

arXiv_CV Attention Relation Pose Action Scene_Parsing
Abstract

The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.

Abstract (translated)

场景解析的复杂度随着物体和场景类别的数量增加而增加，在无限制的开放场景中更高。最大的挑战是在小尺度上成功识别物体，同时建模场景元素之间的空间关系。本文提出了一种新颖的特征增强网络，该网络从多个级联的特征提取中收集空间上下文，并为每个表示级别计算注意力权重以生成最终分类标签。一种新颖的“通道注意力模块”被设计用于计算注意力权重，确保在提取阶段相关的特征得到增强，而其他特征则得到削弱。模型还在低分辨率下学习空间上下文信息，以保留场景元素之间的抽象空间关系，并降低计算成本。在应用特征增强之前，将低分辨率的空间注意力特征连接到最终特征集合中。低分辨率的空间注意力特征使用辅助任务进行训练，帮助学习粗略全局场景结构。与最先进的模型相比，所提出的模型在ADE20K和Cityscapes数据集上都表现出色。

URL

https://arxiv.org/abs/2402.19250

PDF

https://arxiv.org/pdf/2402.19250.pdf
Read All
Unsupervised semantic segmentation of high-resolution UAV imagery for road scene parsing

2024-02-05 13:16:12

Zihan Ma Yongshang Li Ronggui Ma Chen Liang

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Deep_Learning Represenation_Learning Knowledge Language_Model Unsupervised Pose Scene_Parsing Self-Supervised
Abstract

Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.

Abstract (translated)

在对UAV图像进行道路场景解析时，有两个挑战需要面对。首先，UAV图像的高分辨率使得处理过程变得困难。其次，需要大量手动注释才能训练出 robust 和 accurate 的模型，这是监督式深度学习方法的一个缺点。在本文中，介绍了一种利用最近在视觉语言模型和基本计算机视觉模型方面的进展的无需手动注释的无监督道路解析框架。首先，采用一个视觉语言模型来高效地处理超大型分辨率UAV图像，以快速检测图像中的感兴趣道路区域。接着，采用视觉基础模型（SAM）来生成没有类别信息的道路区域的掩码。然后，利用自监督表示学习网络从所有掩码区域提取特征表示。最后，采用无监督聚类算法对特征表示进行聚类，并为每个聚类分配ID。掩码区域与相应的ID结合，生成初始伪标签，从而启动自训练的语义分割过程。与任何手动注释相比，所提出的方法在开发数据集上实现了令人印象深刻的89.96% mIoU。尤其值得注意的是，所提出的方法具有非凡的灵活性，甚至超越了人类定义的范畴，并且能够从数据集中获取新的类别知识。

URL

https://arxiv.org/abs/2402.02985

PDF

https://arxiv.org/pdf/2402.02985.pdf
Read All
SAI3D: Segment Any Instance in 3D Scenes

2023-12-17 09:05:47

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, Baoquan Chen

arXiv_CV

arXiv_CV Segmentation Language_Model Transformer Zero-Shot 3D Scene_Parsing
Abstract

Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing. Empirical evaluations on Scan-Net and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.

Abstract (translated)

传统的3D实例分割进展通常与已标注的数据集的可用性相关，限制了其应用范围局限于少数物体类别。最近的努力试图利用像CLIP这样的视觉语言模型进行开放式语义推理，然而这些方法很难区分同一类别的物体，并依赖于不适用于所有任务的特定提示。在本文中，我们介绍了SAI3D，一种新颖的零散3D实例分割方法，它通过协同利用基于Segment Anything Model（SAM）生成的几何先验和语义线索来取得成功。我们的方法将3D场景分割为几何基本单元，然后将这些单元逐步合并为与多视角SAM掩码一致的3D实例分割。此外，我们还设计了一个具有动态阈值机制的分层区域生长算法，大大提高了细粒度3D场景解析的鲁棒性。在ScanNet和更具挑战性的ScanNet+ datasets上的实证评估表明，我们的方法具有优越性。值得注意的是，SAI3D在ScanNet和ScanNet+ datasets上优于现有开发生词基准，甚至超过了完全监督方法。

URL

https://arxiv.org/abs/2312.11557

PDF

https://arxiv.org/pdf/2312.11557.pdf
Read All
A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors

2023-12-03 02:51:54

Kangcheng Liu

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Weakly_Supervised Detection Object_Detection Review Relation Knowledge Pose 3D Scene_Parsing Point_Cloud
Abstract

Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.

Abstract (translated)

目前最先进的3D点云理解方法仅在完全监督的方式下表现良好。据我们所知，还没有一个统一框架能够同时解决下游的高层次理解任务，尤其是当标签非常有限时。本文提出了一种通用的简单框架来解决标签有限时的点云理解问题。第一个贡献是我们对传统和学习3D描述符在弱监督3D场景理解任务上的方法进行了广泛的比较，并验证了我们自适应的传统PFH-based 3D描述符具有良好的泛化能力。第二个贡献是我们基于传统/学习3D描述符和学习语义提出了基于相似性的区域合并策略。合并过程考虑了低级几何和高级语义特征的相关性。实验结果表明，在三个最重要的弱监督点云理解任务（包括语义分割、实例分割和物体检测）中，我们的框架在标签非常有限的情况下具有最佳性能。我们的方法被称为区域合并3D（RM3D），在各种实验设置下的ScanNet数据高效学习在线基准和其他四个大型3D理解基准上具有卓越的性能，超过了没有复杂学习策略的各个3D理解任务。

URL

https://arxiv.org/abs/2312.01262

PDF

https://arxiv.org/pdf/2312.01262.pdf
Read All
A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing

2023-12-03 02:38:51

Kangcheng Liu

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Weakly_Supervised Detection Object_Detection Relation Knowledge Optimization Unsupervised Pose Autonomous 3D Scene_Parsing Self-Supervised Reconstruction Point_Cloud Robot
Abstract

Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: this https URL.

Abstract (translated)

目前最先进的3D点云理解方法仅在完全监督的情况下表现良好。据我们所知，还没有一个统一框架能够同时解决下游的高级理解任务，特别是在标签非常有限的情况下。本文提出了一种通用的且简单的框架来解决标签有限时的点云理解问题。我们提出了一种新颖的自监督聚类方法生成聚类。更重要的是，我们创新地提出了一种基于局部低级几何性质相似度和由弱标签学习到的较高级特征相似度的聚类方法，从而使真正的弱标签引导伪标签的合并。因此，在考虑几何和语义特征相关性的情况下，指导场景内同义点之间的标签传播。最后，我们提出了自监督重构和数据增强优化模块，以引导大规模3D语义场景解析中标签的传播。实验结果表明，在考虑有限点标签的情况下，我们的框架在包括语义分割、实例分割和目标检测的三个最重要的弱监督点云理解任务中具有最佳性能。这些开发的技术具有将应用于机器人操作和机器人自主导航下游任务的潜在优势。代码和模型公开可用，此处链接：https://this URL。

URL

https://arxiv.org/abs/2312.02208

PDF

https://arxiv.org/pdf/2312.02208.pdf
Read All
Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

2023-12-01 15:47:04

Kangcheng Liu, Yong-Jin Liu, Kai Tang, Ming Liu, Baoquan Chen

arXiv_CV

arXiv_CV Recognition Embedding Knowledge Prediction Language_Model Transformer Unsupervised Pose Few-Shot 3D Scene_Parsing Contrastive_Learning Point_Cloud
Abstract

Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: this https URL.

Abstract (translated)

深度神经网络模型在关闭设置和完整标签的情况下训练在3D场景理解方面取得了显著的进步。然而，当前的3D识别方法的主要瓶颈是，它们无法识别任何未在训练类别之外的新颖类别的现实世界应用。与此同时，最先进的3D场景理解方法主要需要高质量的标签来训练神经网络，而仅仅在完全监督的方式下表现良好。本文提出了一种处理有限标注场景的通用的简单框架。为了从预训练的视觉-语言模型中提取知识，我们提出了一种层次特征对齐的预训练和知识蒸馏策略，以提取和蒸馏大规模视觉-语言模型中的有意义的信息，从而帮助解决开箱见光的场景理解任务。为了利用边界信息，我们提出了一种基于能量的损失函数，其中边界感知使区域级别边界预测受益。为了鼓励潜在实例区分，并确保效率，我们提出了一种自监督的区域级别语义对比学习方案，基于神经网络的自信预测来区分中间特征嵌入的多阶段。在室内和室外场景的广泛实验中，我们的方法在数据有效的学习和开放世界少数样本学习方面都取得了显著的效果。所有代码、模型和数据都公开发布在以下这个链接上：https://this URL。

URL

https://arxiv.org/abs/2312.00663

PDF

https://arxiv.org/pdf/2312.00663.pdf
Read All
CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

2023-09-20 03:36:22

A. Abdullah, T. Barua, R. Tibbetts, Z. Chen, M. J. Islam, I. Rekleitis

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Transformer Pose Autonomous Scene_Parsing
Abstract

In this paper, we present CaveSeg - the first visual learning pipeline for semantic segmentation and scene parsing for AUV navigation inside underwater caves. We address the problem of scarce annotated training data by preparing a comprehensive dataset for semantic segmentation of underwater cave scenes. It contains pixel annotations for important navigation markers (e.g. caveline, arrows), obstacles (e.g. ground plain and overhead layers), scuba divers, and open areas for servoing. Through comprehensive benchmark analyses on cave systems in USA, Mexico, and Spain locations, we demonstrate that robust deep visual models can be developed based on CaveSeg for fast semantic scene parsing of underwater cave environments. In particular, we formulate a novel transformer-based model that is computationally light and offers near real-time execution in addition to achieving state-of-the-art performance. Finally, we explore the design choices and implications of semantic segmentation for visual servoing by AUVs inside underwater caves. The proposed model and benchmark dataset open up promising opportunities for future research in autonomous underwater cave exploration and mapping.

Abstract (translated)

在本文中,我们介绍了水下洞穴中的洞穴SEG(深视觉模型),它是第一个用于AUV在水下洞穴中进行语义分割和场景解析的视觉学习管道。我们为了解决缺乏标注训练数据的问题,准备了一组全面的数据集,用于水下洞穴场景的语义分割。该数据集包含对重要导航标记(例如洞穴线、箭头)像素annotations、障碍物(例如地面平原和上方层)、潜水员、以及用于控制运动的开放区域。通过对美国、墨西哥和西班牙等地的水下洞穴系统的全面基准分析,我们证明了基于洞穴SEG的强有力深度视觉模型可以用于快速水下洞穴环境中语义场景解析。特别是,我们制定了一种新型Transformer-based模型,其计算量较轻,并提供了近乎实时的执行,除了实现最先进的性能外。最后,我们探索了语义分割对AUV水下洞穴内视觉控制的设计和影响。我们提出的模型和基准数据集为未来自主水下洞穴探索和 mapping的研究打开了 promising 机会。

URL

https://arxiv.org/abs/2309.11038

PDF

https://arxiv.org/pdf/2309.11038.pdf
Read All
RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

2023-09-19 06:32:19

Jiahang Li, Yikang Zhang, Peng Yun, Guangliang Zhou, Qijun Chen, Rui Fan

arXiv_CV

arXiv_CV CNN Detection Face Attention Prediction Transformer Scene_Parsing
Abstract

The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.

Abstract (translated)

最近的深度学习卷积神经网络的进步表明,在道路场景解析领域具有巨大的潜力。然而,现有的工作主要关注边界检测,而忽视了可能危及驾驶安全和舒适的危险道路缺陷。在本文中,我们介绍了路 former,这是一个专门为道路场景解析而开发的Transformer基数据融合网络。路 former利用双重编码器架构从RGB图像和表面正则信息中提取不同类型的特征。编码器特征随后被注入到一个 novel 的异质特征协同作用块,以有效地特征融合和重新校准。像素解码器随后从融合和重新校准的异质特征中学习多尺度的长期依赖关系,这些依赖关系随后由Transformer解码器处理,以产生最终的语义预测。此外,我们发布了SYN-UDTIRI,是第一个大规模的道路场景解析数据集,其中包含超过10,407张RGB图像、深度密度图像,以及不同形状和大小的自由空间和道路缺陷的像素级注释。我们对SYN-UDTIRI数据集进行了广泛的实验评估,并与其他三个公共数据集,包括KITTI道路、 CityScapes 和 ORFD,证明了路 former在道路场景解析领域比其他先进的网络表现更好。具体来说,路 former在KITTI道路基准测试中排名第一。我们的源代码、创建的数据集和演示视频均公开可在mias.group/RoadFormer网站上可用。

URL

https://arxiv.org/abs/2309.10356

PDF

https://arxiv.org/pdf/2309.10356.pdf
Read All
Semantic Segmentation on VSPW Dataset through Contrastive Loss and Multi-dataset Training Approach

2023-06-06 08:53:53

Min Yan, Qianxiong Ning, Qian Wang

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Relation Prediction Scene_Parsing
Abstract

Video scene parsing incorporates temporal information, which can enhance the consistency and accuracy of predictions compared to image scene parsing. The added temporal dimension enables a more comprehensive understanding of the scene, leading to more reliable results. This paper presents the winning solution of the CVPR2023 workshop for video semantic segmentation, focusing on enhancing Spatial-Temporal correlations with contrastive loss. We also explore the influence of multi-dataset training by utilizing a label-mapping technique. And the final result is aggregating the output of the above two models. Our approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place on the VSPW challenge at CVPR 2023.

Abstract (translated)

视频场景解析加入了时间信息,可以相较于图像场景解析提高预测的一致性和准确性。增加了时间维度可以实现更全面的理解场景,进而获得更加可靠的结果。本文介绍了CVPR2023年视频语义分割 workshop 中获胜的解决方案,重点研究了增强空间-时间相关性并使用对比损失的方法。此外,我们还使用标签映射技术探讨了多数据集训练的影响。最终的成果是合并了以上两个模型的输出。我们的方法在VSPW数据集上实现了65.95%的IoU表现,在CVPR2023年的VSPW挑战中排名第一。

URL

https://arxiv.org/abs/2306.03508

PDF

https://arxiv.org/pdf/2306.03508.pdf
Read All
Cross-CBAM: A Lightweight network for Scene Segmentation

2023-06-04 09:03:05

Zhengbin Zhang, Zhenhao Xu, Xingsheng Gu, Juan Xiong

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation CNN Attention Inference Pose Scene_Parsing
Abstract

Scene parsing is a great challenge for real-time semantic segmentation. Although traditional semantic segmentation networks have made remarkable leap-forwards in semantic accuracy, the performance of inference speed is unsatisfactory. Meanwhile, this progress is achieved with fairly large networks and powerful computational resources. However, it is difficult to run extremely large models on edge computing devices with limited computing power, which poses a huge challenge to the real-time semantic segmentation tasks. In this paper, we present the Cross-CBAM network, a novel lightweight network for real-time semantic segmentation. Specifically, a Squeeze-and-Excitation Atrous Spatial Pyramid Pooling Module(SE-ASPP) is proposed to get variable field-of-view and multiscale information. And we propose a Cross Convolutional Block Attention Module(CCBAM), in which a cross-multiply operation is employed in the CCBAM module to make high-level semantic information guide low-level detail information. Different from previous work, these works use attention to focus on the desired information in the backbone. CCBAM uses cross-attention for feature fusion in the FPN structure. Extensive experiments on the Cityscapes dataset and Camvid dataset demonstrate the effectiveness of the proposed Cross-CBAM model by achieving a promising trade-off between segmentation accuracy and inference speed. On the Cityscapes test set, we achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of 88.6FPS on NVIDIA GTX 1080Ti.

Abstract (translated)

场景解析是实时语义分割面临的一个巨大的挑战。虽然传统的语义分割网络在语义准确性方面已经取得了显著的进展,但推理速度仍然不满意。与此同时,这种进展是通过相对较大的网络和强大的计算资源实现的。然而,在边缘计算设备上运行巨型模型具有有限的计算能力,这给实时语义分割任务带来了一个巨大的挑战。在本文中,我们提出了Cross-CBAM网络,这是一种全新的轻量级网络,用于实时语义分割。具体来说,我们提出了一种SE-ASPPSqueeze-and-Excitation Atrous Spatial Pyramid Pooling Module(缩放并刺激刺激空间Pyramid Pooling模块),以获取可变视角和多尺度信息。我们还提出了一个Cross Convolutional Block Attention Module(CCBAM),其中在CCBAM模块中采用了交叉乘法操作,以使高层次语义信息指导低层次的细节信息。与以前的工作不同,这些工作使用注意力来关注骨架中的想要的信息。CCBAM使用交叉注意力在FPN结构中进行特征融合。在Cityscapes测试集上,我们实现了73.4%的mIoU,以240.9FPS的速度在NVIDIA GTX 1080Ti上运行,并实现了77.2%的mIoU,以88.6FPS的速度在NVIDIA GTX 1080Ti上运行。

URL

https://arxiv.org/abs/2306.02306

PDF

https://arxiv.org/pdf/2306.02306.pdf
Read All

Content

Scene_Parsing (20)

Scene_Parsing

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF