Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.
https://arxiv.org/abs/2511.07813
Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.
https://arxiv.org/abs/2511.06840
Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.
理解动态环境中物体的运动对于自动驾驶至关重要,从而推动了无类别限制(class-agnostic)运动预测的研究。在这项工作中,我们研究了基于激光雷达点云的弱监督和自监督无类别限制运动预测方法。室外场景通常由移动前景和静态背景组成,这使得对运动的理解可以与场景解析相关联。基于这一观察,我们提出了一种新颖的弱监督范式,用完全或部分标注(1%,0.1%)的前景/背景掩码来替代运动标注以进行监督。 为此,我们开发了一种利用前景/背景线索指导自监督学习的方法,用于运动预测模型。鉴于前景运动通常发生在非地面区域中,非地面/地面掩模可以作为前景/背景掩模的替代品使用,进一步减少注释工作量。基于非地面/地面提示,我们提出了两种额外方法:一种弱监督方法需要更少(0.01%)的前景/背景标注,以及一种无需任何标注的自监督方法。 此外,为了增强自监督学习中的鲁棒性和一致性,我们设计了一种新颖的Robust Consistency-aware Chamfer Distance损失函数。该函数融合了多帧信息,并使用鲁棒惩罚函数来抑制异常值。 实验结果表明,我们的弱监督和自监督模型优于现有的自监督方法,甚至在某些情况下,我们的弱监督模型与一些有监督的方法性能相当。这证明了我们提出的方法能够在标注工作量和性能之间找到有效的平衡点。
https://arxiv.org/abs/2509.13116
Accurate segmentation of thin structures is critical for microsurgical scene understanding but remains challenging due to resolution loss, low contrast, and class imbalance. We propose Microsurgery Instrument Segmentation for Robotic Assistance(MISRA), a segmentation framework that augments RGB input with luminance channels, integrates skip attention to preserve elongated features, and employs an Iterative Feedback Module(IFM) for continuity restoration across multiple passes. In addition, we introduce a dedicated microsurgical dataset with fine-grained annotations of surgical instruments including thin objects, providing a benchmark for robust evaluation Dataset available at this https URL. Experiments demonstrate that MISRA achieves competitive performance, improving the mean class IoU by 5.37% over competing methods, while delivering more stable predictions at instrument contacts and overlaps. These results position MISRA as a promising step toward reliable scene parsing for computer-assisted and robotic microsurgery.
精确分割微细结构对于显微手术场景的理解至关重要,但由于分辨率损失、对比度低和类别不平衡等问题,这一任务仍然具有挑战性。我们提出了一个名为“机器人辅助显微外科器械分割”(Microsurgery Instrument Segmentation for Robotic Assistance, MISRA)的分割框架。该框架通过增加RGB输入中的亮度通道来增强图像信息,并集成跳跃注意力机制以保留拉长特征,同时采用迭代反馈模块(IFM)在多次传递中恢复连续性。此外,我们还引入了一个专门针对显微手术的数据集,其中包含了对手术器械(包括细小物体)的精细标注,为稳健评估提供了基准数据集。该数据集可在提供的网址获取。 实验结果表明,MISRA实现了与竞争方法相媲美的性能,在平均类别交并比(mIoU)上提高了5.37%,并且在器械接触和重叠时能提供更稳定的预测。这些成果将MISRA定位为向计算机辅助和机器人显微手术中可靠场景解析迈出的有希望的一步。
https://arxiv.org/abs/2509.11727
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes. In this survey, we present a holistic review of recent advances in VSP, covering a wide array of vision tasks, including Video Semantic Segmentation (VSS), Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), as well as Video Tracking and Segmentation (VTS), and Open-Vocabulary Video Segmentation (OVVS). We systematically analyze the evolution from traditional hand-crafted features to modern deep learning paradigms -- spanning from fully convolutional networks to the latest transformer-based architectures -- and assess their effectiveness in capturing both local and global temporal contexts. Furthermore, our review critically discusses the technical challenges, ranging from maintaining temporal consistency to handling complex scene dynamics, and offers a comprehensive comparative study of datasets and evaluation metrics that have shaped current benchmarking standards. By distilling the key contributions and shortcomings of state-of-the-art methodologies, this survey highlights emerging trends and prospective research directions that promise to further elevate the robustness and adaptability of VSP in real-world applications.
视频场景解析(VSP)已成为计算机视觉领域的基石,它能够同时对动态场景中的各种视觉实体进行分割、识别和跟踪。本文综述了近期在VSP领域取得的进展,涵盖了包括视频语义分割(VSS)、视频实例分割(VIS)、视频全景分割(VPS),以及视频跟踪与分割(VTS)和开放词汇视频分割(OVVS)在内的广泛视觉任务。我们系统地分析了从传统的手工设计特征到现代深度学习范式的演变过程,涵盖了从全卷积网络到最新基于变换器的架构,并评估它们在捕捉局部和全局时间上下文方面的有效性。此外,我们的综述还批判性地讨论了技术挑战,包括保持时间一致性以及处理复杂的场景动态变化,并提供了对塑造当前基准标准的数据集和评价指标的全面比较研究。通过提炼现有最先进方法的关键贡献与不足之处,本文概述了正在形成的新趋势及具有前景的研究方向,这些方向有望进一步提升VSP在实际应用中的鲁棒性和适应性。
https://arxiv.org/abs/2506.13552
RGB-D scene parsing methods effectively capture both semantic and geometric features of the environment, demonstrating great potential under challenging conditions such as extreme weather and low lighting. However, existing RGB-D scene parsing methods predominantly rely on supervised training strategies, which require a large amount of manually annotated pixel-level labels that are both time-consuming and costly. To overcome these limitations, we introduce DepthMatch, a semi-supervised learning framework that is specifically designed for RGB-D scene parsing. To make full use of unlabeled data, we propose complementary patch mix-up augmentation to explore the latent relationships between texture and spatial features in RGB-D image pairs. We also design a lightweight spatial prior injector to replace traditional complex fusion modules, improving the efficiency of heterogeneous feature fusion. Furthermore, we introduce depth-guided boundary loss to enhance the model's boundary prediction capabilities. Experimental results demonstrate that DepthMatch exhibits high applicability in both indoor and outdoor scenes, achieving state-of-the-art results on the NYUv2 dataset and ranking first on the KITTI Semantics benchmark.
RGB-D场景解析方法能够有效捕捉环境的语义和几何特征,在极端天气和低光照等挑战性条件下表现出巨大潜力。然而,现有的RGB-D场景解析方法主要依赖于监督学习策略,需要大量的手动像素级标签,这些标签既耗时又昂贵。为克服这些限制,我们引入了DepthMatch,这是一种专门为RGB-D场景解析设计的半监督学习框架。为了充分利用未标注的数据,我们提出了互补补丁混合增强技术,以探索RGB-D图像对中纹理和空间特征之间的潜在关系。此外,我们还设计了一种轻量级的空间先验注入器来替代传统的复杂融合模块,提高了异构特征融合的效率。同时,我们引入了深度引导边界损失,以提高模型的边界预测能力。 实验结果表明,DepthMatch在室内和室外场景中都具有很高的适用性,在NYUv2数据集上取得了最先进的成果,并且在KITTI Semantics基准测试中排名第一。
https://arxiv.org/abs/2505.20041
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
扩散模型在文本到图像生成方面表现出色。然而,现有的方法在处理涉及多个对象、特征和关系的复杂提示时常常会遇到性能瓶颈。因此,我们提出了一种基于多智能体协作的组合扩散(MCCD)方法,用于复杂的场景文本到图像生成。具体而言,我们设计了一个基于多智能体协作的场景解析模块,该模块生成一个由执行不同任务的多个代理组成的系统,并利用大规模语言模型有效地提取各种场景元素。此外,层次化组合扩散通过使用高斯掩码和过滤技术来细化边界框区域并增强对象,从而实现了复杂场景的精确且高质量生成。 全面的实验表明,在无需训练的情况下,我们的MCCD显著提高了基准模型在复杂场景生成方面的性能,并提供了明显的优势。
https://arxiv.org/abs/2505.02648
Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at this https URL.
最近的视觉基础模型(VFMs),通常基于Vision Transformer (ViT),已经在众多计算机视觉任务中取得了显著的进步。尽管它们在仅关注RGB图像的任务上表现出色,但在RGB-深度驾驶场景解析中的潜力却尚未被充分探索。本文朝着这一新兴研究领域迈出了一步,通过调查一种可行的技术来充分利用VFMs进行通用的RGB-深度驾驶场景解析。具体而言,我们探讨了RGB和深度数据的基本特性,并提出了异构特征集成变换器(HFIT)。该网络能够高效地提取并整合全面的异构特征,而无需重新训练ViT模型。来自VFMs的相关深度预测结果被用作HFIT侧适配器的输入,从而克服了对深度图依赖的限制。我们提出的HFIT在Cityscapes和KITTI Semantics数据集上与所有传统的单模态和数据融合场景解析网络、预训练的VFMs以及ViT适配器相比,表现出了更优越的性能。我们认为这一新颖策略为基于VFM的数据融合技术在未来驾驶场景解析中的创新铺平了道路。我们的源代码公开可获取于[提供的URL]。
https://arxiv.org/abs/2502.06219
Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of $\mathbf{3.3}$ (Pascal-Parts-58), $\mathbf{3.5}$ (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is $\mathbf{4.0}$. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets. The code is available at this http URL
多对象多部件场景分割是一项具有挑战性的任务,其复杂性随零件的细致程度和场景中对象数量呈指数级增长。为了解决这一问题,我们提出了一种可插拔的方法,称为OLAF。首先,我们在输入(RGB)中加入包含基于物体结构线索的通道(前景/背景掩模、边界边缘掩模)。我们提出了一种权重适应技术,使常规预训练模型在优化过程中能够稳定处理增强后的5通道输入。此外,我们引入了一个编码模块LDF,以提供低级密集特征指导,这有助于分割,特别是对较小的部分来说更为有效。OLAF使得mIoU得分显著提升,在Pascal-Parts-58数据集上提升了$\mathbf{3.3}$,在Pascal-Parts-108数据集上提升了$\mathbf{3.5}$,超过现有最佳模型的表现。在最具挑战性的变体(Pascal-Parts-201)上,提升更是达到了$\mathbf{4.0}$。实验表明,OLAF的广泛适用性使其能够在多种架构(CNN、U-Net、Transformer)和数据集上取得提升。代码可以在以下网址获取:[此HTTP URL]
https://arxiv.org/abs/2411.02858
Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65\% compared to RoadFormer. Our source code will be publicly available at mias.group/RoadFormerPlus.
任务特定数据融合网络在城市场景解析方面取得了显著的成就。在这些网络中,我们最近提出的RoadFormer成功地从红绿蓝图像和表面法线图中提取异质特征,并通过注意力机制将这些特征融合在一起,证明了在RGB-Normal道路场景解析方面具有引人注目的效果。然而,当处理其他类型/来源的数据或执行更通用的全类别场景解析任务时,其性能显著下降。为了克服这些限制,本研究引入了RoadFormer+,一种高效、稳健、可适应的模型,能够有效地将RGB-X数据进行融合,其中“X”代表深度、热、表面法线和极化等额外类型/模块。具体来说,我们提出了一个新颖的混合特征解耦编码器,以提取异质特征并将它们解耦为全局和局部组件。这些解耦的特征通过一个双分支多尺度异质特征融合块进行融合,该块采用并行Transformer注意力和卷积神经网络模块将不同尺度和感受野上的多尺度特征合并。融合后的特征随后输入解码器以生成最终语义预测。值得注意的是,我们提出的RoadFormer+在KITTI Road基准上排名第一,并在Cityscapes、MFNet、FMB和ZJU数据集上实现了与最先进方法相同的平均交集 over 联合。此外,与RoadFormer相比,它减少了65%的可学习参数。我们的源代码将公开发布在mias.group/RoadFormerPlus上。
https://arxiv.org/abs/2407.21631
The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{this https URL}.
目前,对比学习方法主要集中在单粒度表示学习,例如部分级别、对象级别或场景级别,从而忽略了其他粒度级别上表示的转移性。在本文中,我们的目标是学习多粒度表示,可以有效地描述图像在各个粒度级别上的特征,从而提高在广泛的下游任务上的泛化能力。为此,我们提出了一个名为多粒度对比(MGC)的无监督表示学习的新方法。具体来说,我们通过构建积极视角和对应关系之间的细腻多粒度对应关系,然后通过对应关系进行多粒度对比,学习更通用的无监督表示。在没有在大规模数据集上预训练的情况下,我们的方法在广泛的下游任务上显著超过了现有最先进的 methods,包括目标检测、实例分割、场景解析、语义分割和关键点检测。此外,实验结果证实了我们的方法具有数据有效的特性和出色的表示转移能力。源代码和训练后的权重可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2407.02014
The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding.
第三个在野外像素级别的视频理解挑战(PVUW CVPR 2024)旨在通过在大型野外视频全景分割(VIPSeg)测试集和大型野外视频场景解析(VSPW)测试集中对具有挑战性的视频和场景进行基准测试,分别推动视频理解技术的进步。本文详细介绍了我们在PVUW'24 VPS挑战中取得第一名的 research 工作,建立了包括 Video Panoptic Quality(VPQ)和 Segmentation and Tracking Quality(STQ)在内的所有指标的最佳结果。通过微调我们的方法,我们的解决方案还获得了 PVUW'24 VSS 挑战中的第三名,根据mIoU(mean intersection over union)指标排名第三,根据 VC16(16-帧视频一致性)指标排名第一。我们的获胜解决方案站在了大型基础视觉变换模型(DINOv2 ViT-g)和经过验证的多阶段解耦视频实例分割(DVIS)框架的肩膀上,为视频理解技术的发展做出了巨大贡献。
https://arxiv.org/abs/2406.05352
Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model (VLM). Finally, we explore the benefit of the learned representation for scene parsing, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.
雷达传感器具有低成本、远距离和天气耐用性。因此,它们广泛应用于驾驶辅助功能,并预计将在未来自动驾驶的成功中扮演关键角色。在许多感知任务中,只考虑预处理后的雷达点云。相比之下,雷达频谱是一种原始的雷达测量形式,含有比雷达点云更多的信息。然而,雷达频谱的解读相当困难。在这项工作中,我们旨在探索频谱中包含的语义信息,从而在自动驾驶中实现更好的可解释性。为此,我们创建了一个雷达频谱-语言模型,使我们能够使用自由文本查询雷达频谱测量中是否存在场景元素。我们通过将现有的视觉语言模型(VLM)的嵌入空间与雷达频谱数据进行匹配来克服雷达频谱数据的稀缺性。最后,我们探讨了学习表示对场景解析的好处,只需将频谱嵌入注入到基线模型中,就能获得仅通过在自由空间分割和目标检测方面实现提高。
https://arxiv.org/abs/2406.02158
Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.
像素级别的场景理解是计算机视觉中的一个基本问题,旨在识别给定图像中每个像素的对象类、掩码和语义。与图像场景解析相比,视频场景解析引入了时间信息,可以有效提高预测的一致性和准确性,因为现实世界实际上是基于视频的,而不是静态的状态。在本文中,我们采用了基于不可靠伪标签的半监督视频语义分割方法。然后,我们将教师网络模型与学生网络模型集成,生成伪标签并重新训练学生网络。我们的方法在开发测试和最终测试中的mIoU得分分别为63.71%和67.83%。最后,我们在CVPR 2024上的视频场景解析挑战中获得了第一名的成绩。
https://arxiv.org/abs/2406.00587
Advancements in machine learning, computer vision, and robotics have paved the way for transformative solutions in various domains, particularly in agriculture. For example, accurate identification and segmentation of fruits from field images plays a crucial role in automating jobs such as harvesting, disease detection, and yield estimation. However, achieving robust and precise infield fruit segmentation remains a challenging task since large amounts of labeled data are required to handle variations in fruit size, shape, color, and occlusion. In this paper, we develop a few-shot semantic segmentation framework for infield fruits using transfer learning. Concretely, our work is aimed at addressing agricultural domains that lack publicly available labeled data. Motivated by similar success in urban scene parsing, we propose specialized pre-training using a public benchmark dataset for fruit transfer learning. By leveraging pre-trained neural networks, accurate semantic segmentation of fruit in the field is achieved with only a few labeled images. Furthermore, we show that models with pre-training learn to distinguish between fruit still on the trees and fruit that have fallen on the ground, and they can effectively transfer the knowledge to the target fruit dataset.
机器学习、计算机视觉和机器人技术的发展为各个领域带来了 transformative 解决方案,尤其是在农业领域。例如,准确从田间图像中识别和分割水果在自动化诸如采摘、疾病检测和产量估计等任务中扮演着关键角色。然而,实现稳健且精确的田间水果分割仍然具有挑战性,因为需要大量标记数据来处理水果的大小、形状、颜色和遮挡的变异。在本文中,我们为田间水果使用迁移学习开发了一个几 shot semantic segmentation 框架。具体来说,我们的工作旨在解决缺乏公开可用标记数据的农业领域。受到城市场景解析的成功启发,我们提出了使用公共基准数据集进行水果转移学习的专用预训练方案。通过利用预训练的神经网络,可以在仅几张标记图片的情况下实现水果在田间的准确语义分割。此外,我们还证明了经过预训练的模型能够区分仍然在树上的水果和已经掉在地上的水果,并且可以有效地将知识传递到目标水果数据集中。
https://arxiv.org/abs/2405.02556
We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.
我们提出了一个基于编码图像稀疏、潜在特征表示的视觉场景分析和识别系统。该系统将图像稀疏表示编码为高维向量,然后通过分解为解析场景内容。稀疏特征表示通过卷积稀疏编码从图像统计信息中学习,而场景解析由共振器网络完成。将稀疏编码与共振器网络相结合可以增加分布式表示的容量,并在分解过程中减少组合搜索空间中的碰撞。我们发现,对于这个问题,共振器网络能够实现快速和准确的向量分解,并且我们开发了一个基于信心的度量来协助跟踪共振器网络的收敛。
https://arxiv.org/abs/2404.19126
Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
数据融合网络在色温场景解析方面表现出巨大的潜力。然而,现有的研究大多依赖于对称的多层解码器来进行异构特征提取和融合,而忽略了红光和热模态固有的差异。在通过自监督学习大量无标签数据上训练的视觉基础模型(VFMs)的最近进步证明,它们具有提取有信息量的通用特征的能力。然而,在领域内这一潜力尚未得到充分利用。在这项研究中,我们迈出这一新研究领域的一步,通过探索一种可行的策略,充分利用VFM特征进行红光-热场景解析。具体来说,我们深入研究了红光和热模态的独特特点,从而设计了一个半监督的 asymmetric 编码器,该编码器既包含一个VFM,也包含一个卷积神经网络。这种设计允许更有效地提取互补的异质特征,然后以双路、逐步的方式进行融合。此外,我们还引入了一个辅助任务,进一步丰富了融合特征的局部语义,从而提高了整个RGB-热场景解析的性能。我们提出的HAPNet,配备了所有这些组件,在所有其他最先进的RGB-热场景解析网络中表现出卓越的性能,在三处广泛使用的公共RGB-热场景解析数据集上实现了Top Rank。我们相信,这一新范式为数据融合场景解析方法的未来发展打开了新的机会。
https://arxiv.org/abs/2404.03527
The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
场景解析的复杂度随着物体和场景类别的数量增加而增加,在无限制的开放场景中更高。最大的挑战是在小尺度上成功识别物体,同时建模场景元素之间的空间关系。本文提出了一种新颖的特征增强网络,该网络从多个级联的特征提取中收集空间上下文,并为每个表示级别计算注意力权重以生成最终分类标签。一种新颖的“通道注意力模块”被设计用于计算注意力权重,确保在提取阶段相关的特征得到增强,而其他特征则得到削弱。模型还在低分辨率下学习空间上下文信息,以保留场景元素之间的抽象空间关系,并降低计算成本。在应用特征增强之前,将低分辨率的空间注意力特征连接到最终特征集合中。低分辨率的空间注意力特征使用辅助任务进行训练,帮助学习粗略全局场景结构。与最先进的模型相比,所提出的模型在ADE20K和Cityscapes数据集上都表现出色。
https://arxiv.org/abs/2402.19250
Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.
在对UAV图像进行道路场景解析时,有两个挑战需要面对。首先,UAV图像的高分辨率使得处理过程变得困难。其次,需要大量手动注释才能训练出 robust 和 accurate 的模型,这是监督式深度学习方法的一个缺点。在本文中,介绍了一种利用最近在视觉语言模型和基本计算机视觉模型方面的进展的无需手动注释的无监督道路解析框架。首先,采用一个视觉语言模型来高效地处理超大型分辨率UAV图像,以快速检测图像中的感兴趣道路区域。接着,采用视觉基础模型(SAM)来生成没有类别信息的道路区域的掩码。然后,利用自监督表示学习网络从所有掩码区域提取特征表示。最后,采用无监督聚类算法对特征表示进行聚类,并为每个聚类分配ID。掩码区域与相应的ID结合,生成初始伪标签,从而启动自训练的语义分割过程。与任何手动注释相比,所提出的方法在开发数据集上实现了令人印象深刻的89.96% mIoU。尤其值得注意的是,所提出的方法具有非凡的灵活性,甚至超越了人类定义的范畴,并且能够从数据集中获取新的类别知识。
https://arxiv.org/abs/2402.02985
Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing. Empirical evaluations on Scan-Net and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.
传统的3D实例分割进展通常与已标注的数据集的可用性相关,限制了其应用范围局限于少数物体类别。最近的努力试图利用像CLIP这样的视觉语言模型进行开放式语义推理,然而这些方法很难区分同一类别的物体,并依赖于不适用于所有任务的特定提示。在本文中,我们介绍了SAI3D,一种新颖的零散3D实例分割方法,它通过协同利用基于Segment Anything Model(SAM)生成的几何先验和语义线索来取得成功。我们的方法将3D场景分割为几何基本单元,然后将这些单元逐步合并为与多视角SAM掩码一致的3D实例分割。此外,我们还设计了一个具有动态阈值机制的分层区域生长算法,大大提高了细粒度3D场景解析的鲁棒性。在ScanNet和更具挑战性的ScanNet+ datasets上的实证评估表明,我们的方法具有优越性。值得注意的是,SAI3D在ScanNet和ScanNet+ datasets上优于现有开发生词基准,甚至超过了完全监督方法。
https://arxiv.org/abs/2312.11557