To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.
为持续增强手术视频场景解析中的模型适应性,近期研究通过增量更新使其逐步学会随时间推移分割越来越多的手术器械。然而,先前工作始终忽视了正向知识迁移的潜力(即过往知识如何帮助学习新类别)与反向知识迁移的潜力(即学习新类别如何帮助优化过往知识)。本文提出一种自反思层次提示框架,释放类别增量分割中正向与反向知识迁移的效能,旨在熟练掌握新器械、提升常规器械的现有技能,并避免旧器械的灾难性遗忘。该框架基于冻结的预训练模型构建,能在训练过程中自适应地为新类别追加器械感知提示。为实现正向知识迁移,我们将器械提示组织为层次化提示解析树:以器械共享提示分区为根节点,以n部分共享提示分区为中间节点,以器械专属提示分区为叶节点,从而为新类别暴露可复用的历史知识以简化其学习过程。反之,为促进反向知识迁移,我们通过有向加权图传播对现有知识进行自反思精炼,检验树中记录的知识关联以提升其表征性,同时避免灾难性遗忘。该框架既适用于基于CNN的模型,也适用于先进的基于Transformer的基础模型,在两个公开基准测试上分别较竞争方法取得超过5%和11%的性能提升。
https://arxiv.org/abs/2604.02877
We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.
我们提出SLARM,一种统一动态场景重建、语义理解与实时流式推理的前馈模型。SLARM通过高阶运动建模捕捉复杂非均匀运动,仅使用可微渲染进行训练而无需光流监督。此外,SLARM从LSeg中提炼语义特征以获得语言对齐表征,该设计支持通过自然语言进行语义查询,且语义与几何的紧密耦合进一步提升了动态重建的精度与鲁棒性。同时,SLARM采用基于窗口的因果注意力处理图像序列,实现了稳定低延迟的流式推理且无需累积内存成本。在此统一框架下,SLARM在动态估计、渲染质量与场景解析任务中均达到最优性能,相较于现有方法,运动精度提升21%,重建PSNR提升1.6dB,分割mIoU提升20%。
https://arxiv.org/abs/2603.22893
Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model's ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.
https://arxiv.org/abs/2603.16083
Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
从自然语言描述理解并定位复杂三维环境中的物体,即所谓的三维视觉接地(3DVG),是具身人工智能领域的基础挑战之一。这一技术在机器人学、增强现实和人机交互等方面具有广泛的应用前景。大规模预训练的模型在此领域取得了显著进展,使得开放词汇表的3DVG得以实现,允许系统定位给定场景中的任意物体。然而,这些方法依赖于预训练模型,这限制了三维感知与推理在继承知识范围内的灵活性,导致对未见过的空间关系和分布外场景的泛化能力有限且鲁棒性较差。 本文提出了一种新的解决方案:通过无训练的视觉和几何推理来替代这种受限的感知方式。这种方法解锁了一个开放世界的3DVG系统,该系统能够在超出训练数据集范围内的任何场景中定位任意物体。具体而言,所提出的UniGround系统分为两个阶段: 1. 全局候选过滤阶段(Global Candidate Filtering stage):通过无训练的三维拓扑和多视角语义编码构造场景候选; 2. 局部精准接地阶段(Local Precision Grounding stage):利用多层次视觉提示和结构化推理精确识别目标物体。 实验结果表明,在ScanRefer和EmbodiedScan数据集上,UniGround在ScanRefer中的Acc@0.25/0.5分别为46.1%/34.1%,在EmbodiedScan中为28.7% Acc@0.25。这使得它在没有三维监督的情况下,成为EmbodiedScan零样本方法的新状态之下的最优性能。此外,在不受控制的重建条件下和显著领域的变化中测试了UniGround的真实世界环境应用,显示出了无训练推理能稳健地超越精心策划的基准测试。
https://arxiv.org/abs/2603.08131
Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS's core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.
https://arxiv.org/abs/2602.13588
High-precision scene parsing tasks, including image matting and dichotomous segmentation, aim to accurately predict masks with extremely fine details (such as hair). Most existing methods focus on salient, single foreground objects. While interactive methods allow for target adjustment, their class-agnostic design restricts generalization across different categories. Furthermore, the scarcity of high-quality annotation has led to a reliance on inharmonious synthetic data, resulting in poor generalization to real-world scenarios. To this end, we propose a Foreground Consistent Learning model, dubbed as FCLM, to address the aforementioned issues. Specifically, we first introduce a Depth-Aware Distillation strategy where we transfer the depth-related knowledge for better foreground representation. Considering the data dilemma, we term the processing of synthetic data as domain adaptation problem where we propose a domain-invariant learning strategy to focus on foreground learning. To support interactive prediction, we contribute an Object-Oriented Decoder that can receive both visual and language prompts to predict the referring target. Experimental results show that our method quantitatively and qualitatively outperforms SOTA methods.
高精度场景解析任务,包括图像抠图和二元分割,旨在准确预测包含极细微节(如头发)的掩码。现有大多数方法主要关注显著、单一前景对象。尽管交互式方法允许目标调整,但它们无类别感知的设计限制了跨不同类别的泛化能力。此外,高质量标注数据的稀缺导致对不协调合成数据的高度依赖,从而在真实场景中的表现不佳。 为了解决上述问题,我们提出了一种名为FCLM(Foreground Consistent Learning model)的方法来解决这些问题。具体来说,我们首先引入了深度感知蒸馏策略,在该策略中我们转移与深度相关的知识以更好地表示前景对象。考虑到数据困境,我们将合成数据的处理视为领域适应问题,并提出了域不变学习策略专注于前景学习。为了支持交互式预测,我们贡献了一个面向对象解码器,它可以接收视觉和语言提示来预测引用目标。 实验结果表明,我们的方法在定量和定性上均优于现有最先进的方法(SOTA)。
https://arxiv.org/abs/2601.12080
Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.
https://arxiv.org/abs/2511.07813
Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.
https://arxiv.org/abs/2511.06840
Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.
理解动态环境中物体的运动对于自动驾驶至关重要,从而推动了无类别限制(class-agnostic)运动预测的研究。在这项工作中,我们研究了基于激光雷达点云的弱监督和自监督无类别限制运动预测方法。室外场景通常由移动前景和静态背景组成,这使得对运动的理解可以与场景解析相关联。基于这一观察,我们提出了一种新颖的弱监督范式,用完全或部分标注(1%,0.1%)的前景/背景掩码来替代运动标注以进行监督。 为此,我们开发了一种利用前景/背景线索指导自监督学习的方法,用于运动预测模型。鉴于前景运动通常发生在非地面区域中,非地面/地面掩模可以作为前景/背景掩模的替代品使用,进一步减少注释工作量。基于非地面/地面提示,我们提出了两种额外方法:一种弱监督方法需要更少(0.01%)的前景/背景标注,以及一种无需任何标注的自监督方法。 此外,为了增强自监督学习中的鲁棒性和一致性,我们设计了一种新颖的Robust Consistency-aware Chamfer Distance损失函数。该函数融合了多帧信息,并使用鲁棒惩罚函数来抑制异常值。 实验结果表明,我们的弱监督和自监督模型优于现有的自监督方法,甚至在某些情况下,我们的弱监督模型与一些有监督的方法性能相当。这证明了我们提出的方法能够在标注工作量和性能之间找到有效的平衡点。
https://arxiv.org/abs/2509.13116
Accurate segmentation of thin structures is critical for microsurgical scene understanding but remains challenging due to resolution loss, low contrast, and class imbalance. We propose Microsurgery Instrument Segmentation for Robotic Assistance(MISRA), a segmentation framework that augments RGB input with luminance channels, integrates skip attention to preserve elongated features, and employs an Iterative Feedback Module(IFM) for continuity restoration across multiple passes. In addition, we introduce a dedicated microsurgical dataset with fine-grained annotations of surgical instruments including thin objects, providing a benchmark for robust evaluation Dataset available at this https URL. Experiments demonstrate that MISRA achieves competitive performance, improving the mean class IoU by 5.37% over competing methods, while delivering more stable predictions at instrument contacts and overlaps. These results position MISRA as a promising step toward reliable scene parsing for computer-assisted and robotic microsurgery.
精确分割微细结构对于显微手术场景的理解至关重要,但由于分辨率损失、对比度低和类别不平衡等问题,这一任务仍然具有挑战性。我们提出了一个名为“机器人辅助显微外科器械分割”(Microsurgery Instrument Segmentation for Robotic Assistance, MISRA)的分割框架。该框架通过增加RGB输入中的亮度通道来增强图像信息,并集成跳跃注意力机制以保留拉长特征,同时采用迭代反馈模块(IFM)在多次传递中恢复连续性。此外,我们还引入了一个专门针对显微手术的数据集,其中包含了对手术器械(包括细小物体)的精细标注,为稳健评估提供了基准数据集。该数据集可在提供的网址获取。 实验结果表明,MISRA实现了与竞争方法相媲美的性能,在平均类别交并比(mIoU)上提高了5.37%,并且在器械接触和重叠时能提供更稳定的预测。这些成果将MISRA定位为向计算机辅助和机器人显微手术中可靠场景解析迈出的有希望的一步。
https://arxiv.org/abs/2509.11727
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes. In this survey, we present a holistic review of recent advances in VSP, covering a wide array of vision tasks, including Video Semantic Segmentation (VSS), Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), as well as Video Tracking and Segmentation (VTS), and Open-Vocabulary Video Segmentation (OVVS). We systematically analyze the evolution from traditional hand-crafted features to modern deep learning paradigms -- spanning from fully convolutional networks to the latest transformer-based architectures -- and assess their effectiveness in capturing both local and global temporal contexts. Furthermore, our review critically discusses the technical challenges, ranging from maintaining temporal consistency to handling complex scene dynamics, and offers a comprehensive comparative study of datasets and evaluation metrics that have shaped current benchmarking standards. By distilling the key contributions and shortcomings of state-of-the-art methodologies, this survey highlights emerging trends and prospective research directions that promise to further elevate the robustness and adaptability of VSP in real-world applications.
视频场景解析(VSP)已成为计算机视觉领域的基石,它能够同时对动态场景中的各种视觉实体进行分割、识别和跟踪。本文综述了近期在VSP领域取得的进展,涵盖了包括视频语义分割(VSS)、视频实例分割(VIS)、视频全景分割(VPS),以及视频跟踪与分割(VTS)和开放词汇视频分割(OVVS)在内的广泛视觉任务。我们系统地分析了从传统的手工设计特征到现代深度学习范式的演变过程,涵盖了从全卷积网络到最新基于变换器的架构,并评估它们在捕捉局部和全局时间上下文方面的有效性。此外,我们的综述还批判性地讨论了技术挑战,包括保持时间一致性以及处理复杂的场景动态变化,并提供了对塑造当前基准标准的数据集和评价指标的全面比较研究。通过提炼现有最先进方法的关键贡献与不足之处,本文概述了正在形成的新趋势及具有前景的研究方向,这些方向有望进一步提升VSP在实际应用中的鲁棒性和适应性。
https://arxiv.org/abs/2506.13552
RGB-D scene parsing methods effectively capture both semantic and geometric features of the environment, demonstrating great potential under challenging conditions such as extreme weather and low lighting. However, existing RGB-D scene parsing methods predominantly rely on supervised training strategies, which require a large amount of manually annotated pixel-level labels that are both time-consuming and costly. To overcome these limitations, we introduce DepthMatch, a semi-supervised learning framework that is specifically designed for RGB-D scene parsing. To make full use of unlabeled data, we propose complementary patch mix-up augmentation to explore the latent relationships between texture and spatial features in RGB-D image pairs. We also design a lightweight spatial prior injector to replace traditional complex fusion modules, improving the efficiency of heterogeneous feature fusion. Furthermore, we introduce depth-guided boundary loss to enhance the model's boundary prediction capabilities. Experimental results demonstrate that DepthMatch exhibits high applicability in both indoor and outdoor scenes, achieving state-of-the-art results on the NYUv2 dataset and ranking first on the KITTI Semantics benchmark.
RGB-D场景解析方法能够有效捕捉环境的语义和几何特征,在极端天气和低光照等挑战性条件下表现出巨大潜力。然而,现有的RGB-D场景解析方法主要依赖于监督学习策略,需要大量的手动像素级标签,这些标签既耗时又昂贵。为克服这些限制,我们引入了DepthMatch,这是一种专门为RGB-D场景解析设计的半监督学习框架。为了充分利用未标注的数据,我们提出了互补补丁混合增强技术,以探索RGB-D图像对中纹理和空间特征之间的潜在关系。此外,我们还设计了一种轻量级的空间先验注入器来替代传统的复杂融合模块,提高了异构特征融合的效率。同时,我们引入了深度引导边界损失,以提高模型的边界预测能力。 实验结果表明,DepthMatch在室内和室外场景中都具有很高的适用性,在NYUv2数据集上取得了最先进的成果,并且在KITTI Semantics基准测试中排名第一。
https://arxiv.org/abs/2505.20041
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
扩散模型在文本到图像生成方面表现出色。然而,现有的方法在处理涉及多个对象、特征和关系的复杂提示时常常会遇到性能瓶颈。因此,我们提出了一种基于多智能体协作的组合扩散(MCCD)方法,用于复杂的场景文本到图像生成。具体而言,我们设计了一个基于多智能体协作的场景解析模块,该模块生成一个由执行不同任务的多个代理组成的系统,并利用大规模语言模型有效地提取各种场景元素。此外,层次化组合扩散通过使用高斯掩码和过滤技术来细化边界框区域并增强对象,从而实现了复杂场景的精确且高质量生成。 全面的实验表明,在无需训练的情况下,我们的MCCD显著提高了基准模型在复杂场景生成方面的性能,并提供了明显的优势。
https://arxiv.org/abs/2505.02648
Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at this https URL.
最近的视觉基础模型(VFMs),通常基于Vision Transformer (ViT),已经在众多计算机视觉任务中取得了显著的进步。尽管它们在仅关注RGB图像的任务上表现出色,但在RGB-深度驾驶场景解析中的潜力却尚未被充分探索。本文朝着这一新兴研究领域迈出了一步,通过调查一种可行的技术来充分利用VFMs进行通用的RGB-深度驾驶场景解析。具体而言,我们探讨了RGB和深度数据的基本特性,并提出了异构特征集成变换器(HFIT)。该网络能够高效地提取并整合全面的异构特征,而无需重新训练ViT模型。来自VFMs的相关深度预测结果被用作HFIT侧适配器的输入,从而克服了对深度图依赖的限制。我们提出的HFIT在Cityscapes和KITTI Semantics数据集上与所有传统的单模态和数据融合场景解析网络、预训练的VFMs以及ViT适配器相比,表现出了更优越的性能。我们认为这一新颖策略为基于VFM的数据融合技术在未来驾驶场景解析中的创新铺平了道路。我们的源代码公开可获取于[提供的URL]。
https://arxiv.org/abs/2502.06219
Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of $\mathbf{3.3}$ (Pascal-Parts-58), $\mathbf{3.5}$ (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is $\mathbf{4.0}$. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets. The code is available at this http URL
多对象多部件场景分割是一项具有挑战性的任务,其复杂性随零件的细致程度和场景中对象数量呈指数级增长。为了解决这一问题,我们提出了一种可插拔的方法,称为OLAF。首先,我们在输入(RGB)中加入包含基于物体结构线索的通道(前景/背景掩模、边界边缘掩模)。我们提出了一种权重适应技术,使常规预训练模型在优化过程中能够稳定处理增强后的5通道输入。此外,我们引入了一个编码模块LDF,以提供低级密集特征指导,这有助于分割,特别是对较小的部分来说更为有效。OLAF使得mIoU得分显著提升,在Pascal-Parts-58数据集上提升了$\mathbf{3.3}$,在Pascal-Parts-108数据集上提升了$\mathbf{3.5}$,超过现有最佳模型的表现。在最具挑战性的变体(Pascal-Parts-201)上,提升更是达到了$\mathbf{4.0}$。实验表明,OLAF的广泛适用性使其能够在多种架构(CNN、U-Net、Transformer)和数据集上取得提升。代码可以在以下网址获取:[此HTTP URL]
https://arxiv.org/abs/2411.02858
Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65\% compared to RoadFormer. Our source code will be publicly available at mias.group/RoadFormerPlus.
任务特定数据融合网络在城市场景解析方面取得了显著的成就。在这些网络中,我们最近提出的RoadFormer成功地从红绿蓝图像和表面法线图中提取异质特征,并通过注意力机制将这些特征融合在一起,证明了在RGB-Normal道路场景解析方面具有引人注目的效果。然而,当处理其他类型/来源的数据或执行更通用的全类别场景解析任务时,其性能显著下降。为了克服这些限制,本研究引入了RoadFormer+,一种高效、稳健、可适应的模型,能够有效地将RGB-X数据进行融合,其中“X”代表深度、热、表面法线和极化等额外类型/模块。具体来说,我们提出了一个新颖的混合特征解耦编码器,以提取异质特征并将它们解耦为全局和局部组件。这些解耦的特征通过一个双分支多尺度异质特征融合块进行融合,该块采用并行Transformer注意力和卷积神经网络模块将不同尺度和感受野上的多尺度特征合并。融合后的特征随后输入解码器以生成最终语义预测。值得注意的是,我们提出的RoadFormer+在KITTI Road基准上排名第一,并在Cityscapes、MFNet、FMB和ZJU数据集上实现了与最先进方法相同的平均交集 over 联合。此外,与RoadFormer相比,它减少了65%的可学习参数。我们的源代码将公开发布在mias.group/RoadFormerPlus上。
https://arxiv.org/abs/2407.21631
The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{this https URL}.
目前,对比学习方法主要集中在单粒度表示学习,例如部分级别、对象级别或场景级别,从而忽略了其他粒度级别上表示的转移性。在本文中,我们的目标是学习多粒度表示,可以有效地描述图像在各个粒度级别上的特征,从而提高在广泛的下游任务上的泛化能力。为此,我们提出了一个名为多粒度对比(MGC)的无监督表示学习的新方法。具体来说,我们通过构建积极视角和对应关系之间的细腻多粒度对应关系,然后通过对应关系进行多粒度对比,学习更通用的无监督表示。在没有在大规模数据集上预训练的情况下,我们的方法在广泛的下游任务上显著超过了现有最先进的 methods,包括目标检测、实例分割、场景解析、语义分割和关键点检测。此外,实验结果证实了我们的方法具有数据有效的特性和出色的表示转移能力。源代码和训练后的权重可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2407.02014
The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding.
第三个在野外像素级别的视频理解挑战(PVUW CVPR 2024)旨在通过在大型野外视频全景分割(VIPSeg)测试集和大型野外视频场景解析(VSPW)测试集中对具有挑战性的视频和场景进行基准测试,分别推动视频理解技术的进步。本文详细介绍了我们在PVUW'24 VPS挑战中取得第一名的 research 工作,建立了包括 Video Panoptic Quality(VPQ)和 Segmentation and Tracking Quality(STQ)在内的所有指标的最佳结果。通过微调我们的方法,我们的解决方案还获得了 PVUW'24 VSS 挑战中的第三名,根据mIoU(mean intersection over union)指标排名第三,根据 VC16(16-帧视频一致性)指标排名第一。我们的获胜解决方案站在了大型基础视觉变换模型(DINOv2 ViT-g)和经过验证的多阶段解耦视频实例分割(DVIS)框架的肩膀上,为视频理解技术的发展做出了巨大贡献。
https://arxiv.org/abs/2406.05352
Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model (VLM). Finally, we explore the benefit of the learned representation for scene parsing, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.
雷达传感器具有低成本、远距离和天气耐用性。因此,它们广泛应用于驾驶辅助功能,并预计将在未来自动驾驶的成功中扮演关键角色。在许多感知任务中,只考虑预处理后的雷达点云。相比之下,雷达频谱是一种原始的雷达测量形式,含有比雷达点云更多的信息。然而,雷达频谱的解读相当困难。在这项工作中,我们旨在探索频谱中包含的语义信息,从而在自动驾驶中实现更好的可解释性。为此,我们创建了一个雷达频谱-语言模型,使我们能够使用自由文本查询雷达频谱测量中是否存在场景元素。我们通过将现有的视觉语言模型(VLM)的嵌入空间与雷达频谱数据进行匹配来克服雷达频谱数据的稀缺性。最后,我们探讨了学习表示对场景解析的好处,只需将频谱嵌入注入到基线模型中,就能获得仅通过在自由空间分割和目标检测方面实现提高。
https://arxiv.org/abs/2406.02158
Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.
像素级别的场景理解是计算机视觉中的一个基本问题,旨在识别给定图像中每个像素的对象类、掩码和语义。与图像场景解析相比,视频场景解析引入了时间信息,可以有效提高预测的一致性和准确性,因为现实世界实际上是基于视频的,而不是静态的状态。在本文中,我们采用了基于不可靠伪标签的半监督视频语义分割方法。然后,我们将教师网络模型与学生网络模型集成,生成伪标签并重新训练学生网络。我们的方法在开发测试和最终测试中的mIoU得分分别为63.71%和67.83%。最后,我们在CVPR 2024上的视频场景解析挑战中获得了第一名的成绩。
https://arxiv.org/abs/2406.00587